43 research outputs found
Multidisciplinary perspectives on Artificial Intelligence and the law
This open access book presents an interdisciplinary, multi-authored, edited collection of chapters on Artificial Intelligence (‘AI’) and the Law. AI technology has come to play a central role in the modern data economy. Through a combination of increased computing power, the growing availability of data and the advancement of algorithms, AI has now become an umbrella term for some of the most transformational technological breakthroughs of this age. The importance of AI stems from both the opportunities that it offers and the challenges that it entails. While AI applications hold the promise of economic growth and efficiency gains, they also create significant risks and uncertainty. The potential and perils of AI have thus come to dominate modern discussions of technology and ethics – and although AI was initially allowed to largely develop without guidelines or rules, few would deny that the law is set to play a fundamental role in shaping the future of AI. As the debate over AI is far from over, the need for rigorous analysis has never been greater. This book thus brings together contributors from different fields and backgrounds to explore how the law might provide answers to some of the most pressing questions raised by AI. An outcome of the Católica Research Centre for the Future of Law and its interdisciplinary working group on Law and Artificial Intelligence, it includes contributions by leading scholars in the fields of technology, ethics and the law.info:eu-repo/semantics/publishedVersio
Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors
Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability.
However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data.
First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal.
Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion.
The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
Simulating realistic multiparty speech data: for the development of distant microphone ASR systems
Automatic speech recognition has become a ubiquitous technology integrated into our daily lives. However, the problem remains challenging when the speaker is far away from the microphone. In such scenarios, the speech is degraded both by reverberation and by the presence of additive noise. This situation is particularly challenging when there are competing speakers present (i.e. multi-party scenarios)
Acoustic scene simulation has been a major tool for training and developing distant microphone speech recognition systems, and is now being used to develop solutions for mult-party scenarios. It has been used both in training -- as it allows cheap generation of limitless amounts of data -- and for evaluation -- because it can provide easy access to a ground truth (i.e. a noise-free target signal). However, whilst much work has been conducted to produce realistic artificial scene simulators, the signals produced from such simulators are only as good as the `metadata' being used to define the setups, i.e., the data describing, for example, the number of speakers and their distribution relative to the microphones.
This thesis looks at how realistic metadata can be derived by analysing how speakers behave in real domestic environments. In particular, how to produce scenes that provide a realistic distribution for various factors that are known to influence the 'difficulty' of the scene, including the separation angle between speakers, the absolute and relative distances of speakers to microphones, and the pattern of temporal overlap of speech. Using an existing audio-visual multi-party conversational dataset, CHiME-5, each of these aspects has been studied in turn.
First, producing a realistic angular separation between speakers allows for algorithms which enhance signals based on the direction of arrival to be fairly evaluated, reducing the mismatch between real and simulated data. This was estimated using automatic people detection techniques in video recordings from CHiME-5. Results show that commonly used datasets of simulated signals do not follow a realistic distribution, and when a realistic distribution is enforced, a significant drop in performance is observed.
Second, by using multiple cameras it has been possible to estimate the 2-D positions of people inside each scene. This has allowed the estimation of realistic distributions for the absolute distance to the microphone and relative distance to the competing speaker. The results show grouping behaviour among participants when located in a room and the impact this has on performance depends on the room size considered.
Finally, the amount of overlap and points in the mixture which contain overlap were explored using finite-state models. These models allowed for mixtures to be generated, which approached the overlap patterns observed in the real data. Features derived from these models were also shown to be a predictor of the difficulty of the mixture.
At each stage of the project, simulated datasets derived using the realistic metadata distributions have been compared to existing standard datasets that use naive or uninformed metadata distributions, and implications for speech recognition performance are observed and discussed. This work has demonstrated how unrealistic approaches can produce over-promising results, and can bias research towards techniques that might not work well in practice. Results will also be valuable in informing the design of future simulated datasets
Computational Intelligence and Human- Computer Interaction: Modern Methods and Applications
The present book contains all of the articles that were accepted and published in the Special Issue of MDPI’s journal Mathematics titled "Computational Intelligence and Human–Computer Interaction: Modern Methods and Applications". This Special Issue covered a wide range of topics connected to the theory and application of different computational intelligence techniques to the domain of human–computer interaction, such as automatic speech recognition, speech processing and analysis, virtual reality, emotion-aware applications, digital storytelling, natural language processing, smart cars and devices, and online learning. We hope that this book will be interesting and useful for those working in various areas of artificial intelligence, human–computer interaction, and software engineering as well as for those who are interested in how these domains are connected in real-life situations
Interaction analytics for automatic assessment of communication quality in primary care
Effective doctor-patient communication is a crucial element of health care, influencing patients’ personal and medical outcomes following the interview.
The set of skills used in interpersonal interaction is complex, involving verbal
and non-verbal behaviour. Precise attributes of good non-verbal behaviour
are difficult to characterise, but models and studies offer insight on relevant
factors. In this PhD, I studied how the attributes of non-verbal behaviour can
be automatically extracted and assessed, focusing on turn-taking patterns of
and the prosody of patient-clinician dialogues.
I described clinician-patient communication and the tools and methods used to
train and assess communication during the consultation. I then proceeded to
a review of the literature on the existing efforts to automate assessment, depicting an emerging domain focused on the semantic content of the exchange
and a lack of investigation on interaction dynamics, notably on the structure of
turns and prosody.
To undertake the study of these aspects, I initially planned the collection of
data. I underlined the need for a system that follows the requirements of sensitive data collection regarding data quality and security. I went on to design a
secure system which records participants’ speech as well as the body posture
of the clinician. I provided an open-source implementation and I supported its
use by the scientific community.
I investigated the automatic extraction and analysis of some non-verbal components of the clinician-patient communication on an existing corpus of GP
consultations. I outlined different patterns in the clinician-patient interaction
and I further developed explanations of known consulting behaviours, such as
the general imbalance of the doctor-patient interaction and differences in the
control of the conversation.
I compared behaviours present in face to face, telephone, and video consultations, finding overall similarities alongside noticeable differences in patterns of
overlapping speech and switching behaviour.
I further studied non-verbal signals by analysing speech prosodic features, investigating differences in participants’ behaviour and relations between the assessment of the clinician-patient communication and prosodic features. While
limited in their interpretative power on the explored dataset, these signals
nonetheless provide additional metrics to identify and characterise variations
in the non-verbal behaviour of the participants.
Analysing clinician-patient communication is difficult even for human experts.
Automating that process in this work has been particularly challenging. I demonstrated the capacity of automated processing of non-verbal behaviours to analyse clinician-patient communication. I outlined the ability to explore new aspects, interaction dynamics, and objectively describe how patients and clinicians interact. I further explained known aspects such as clinician dominance
in more detail. I also provided a methodology to characterise participants’ turns
taking behaviour and speech prosody for the objective appraisal of the quality of non-verbal communication. This methodology is aimed at further use in
research and education
CLARIN. The infrastructure for language resources
CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future.
The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)
‘Did the speaker change?’: Temporal tracking for overlapping speaker segmentation in multi-speaker scenarios
Diarization systems are an essential part of many speech processing applications, such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker domains. This thesis will focus on the first task of the diarization process, that being the task of speaker segmentation which can be thought of as trying to answer the question ‘Did the speaker change?’ in an audio recording.
This thesis starts by showing that time-varying pitch properties can be used advantageously within the segmentation step of a multi-talker diarization system. It is then highlighted that an individual’s pitch is smoothly varying and, therefore, can be predicted by means of a Kalman filter. Subsequently, it is shown that if the pitch is not predictable, then this is most likely due to a change in the speaker. Finally, a novel system is proposed that uses this approach of pitch prediction for speaker change detection.
This thesis then goes on to demonstrate how voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker.
This thesis then extends this work to explore the use of a new multimodal approach for overlapping speaker segmentation that tracks both the fundamental frequency (F0) and direction of arrival (DoA) of each speaker simultaneously. The proposed multiple hypothesis tracking system, which simultaneously tracks both features, shows an improvement in segmentation performance when compared to tracking these features separately.
Lastly, this thesis focuses on the DoA estimation part of the newly proposed multimodal approach. It does this by exploring a polynomial extension to the multiple signal classification (MUSIC) algorithm, spatio-spectral polynomial (SSP)-MUSIC, and evaluating its performance when using speech sound sources.Open Acces
Speaker Diarization
Disertační práce se zaměřuje na téma diarizace řečníků, což je úloha zpracování řeči typicky charakterizovaná otázkou "Kdo kdy mluví?". Práce se také zabývá související úlohou detekce překrývající se řeči, která je velmi relevantní pro diarizaci.
Teoretická část práce poskytuje přehled existujících metod diarizace řečníků, a to jak těch offline, tak online, a přibližuje několik problematických oblastí, které byly identifikovány v rané fázi autorčina výzkumu. V práci je také předloženo rozsáhlé srovnání existujících systémů se zaměřením na jejich uváděné výsledky. Jedna kapitola se také zaměřuje na téma překrývající se řeči a na metody její detekce.
Experimentální část práce předkládá praktické výstupy, kterých bylo dosaženo. Experimenty s diarizací se zaměřovaly zejména na online systém založený na GMM a na i-vektorový systém, který měl offline i online varianty. Závěrečná sekce experimentů také přibližuje nově navrženou metodu pro detekci překrývající se řeči, která je založena na konvoluční neuronové síti.ObhájenoThe thesis focuses on the topic of speaker diarization, a speech processing task that is commonly characterized as the question "Who speaks when?". It also addresses the related task of overlapping speech detection, which is very relevant for diarization.
The theoretical part of the thesis provides an overview of existing diarization approaches, both offline and online, and discusses some of the problematic areas which were identified in early stages of the author's research. The thesis also includes an extensive comparison of existing diarization systems, with focus on their reported performance. One chapter is also dedicated to the topic of overlapping speech and the methods of its detection.
The experimental part of the thesis then presents the work which has been done on speaker diarization, which was focused mostly on a GMM-based online diarization system and an i-vector based system with both offline and online variants. The final section also details a newly proposed approach for detecting overlapping speech using a convolutional neural network
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes