10 research outputs found
IRRMA: An Image Recommender Robot Meeting Assistant
The number of people who attend virtual meetings has increased as a result of COVID-19. In this paper, we present a system that consists of an expressive humanoid social robot called QTRobot, and a recommender system that employs natural language processing techniques to recommend images related to the content of the presenter’s speech to the audience in real time. This is achieved utilising the QTRobot’s platform capabilities (microphone, computation power, and Wi-Fi)
The Prosody of Cheering in Sport Events
Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering, namely inciting cheering for single addressees in sport events (here, long-distance running), using the name of that person. 31 native speakers of German took part in the experiment. They were asked to cheer up an individual marathon runner in a sporting event represented by video by producing his or her name (1-5 syllables long). For reasons of comparison, the participants also produced the same names in isolation and carrier sentences. Our results reveal that speakers use different strategies to meet their motivational communicative goals: while some speakers produced the runners’ names by dividing them into syllables, others pronounced the names as quickly as possible putting more emphasis on the first syllable. A few speakers followed a mixed strategy. Contrary to our expectations, it was not the intensity that mostly contributes to the differences between the different speaking styles (cheering vs. neutral), at least in the methods we were using. Rather, participants employed higher fundamental frequency and longer duration when cheering for marathon runners
LUX-ASR: Building an ASR system for the Luxembourgish language
We present a first system for automatic speech recognition
(ASR) for the low-resource language Luxembourgish. By
applying transfer-learning, we were able to fine-tune Meta’s
wav2vec2-xls-r-300m checkpoint with 35 hours of labeled
Luxembourgish speech data. The best word error rate received lies at 14.47
XAI: Using Smart Photobooth for Explaining History of Art
The rise of Artificial Intelligence has led to advancements in daily life, including applications in industries, telemedicine, farming, and smart cities. It is necessary to have human-AI synergies to guarantee user engagement and provide interactive expert knowledge, despite AI’s success in "less technical" fields. In this article, the possible synergies between humans and AI to explain the development of art history and artistic style transfer are discussed. This study is part of the "Smart Photobooth" project that is able to automatically transform a user’s picture into a well-known artistic style as an interactive approach to introduce the fundamentals of the history of art to the common people and provide them with a concise explanation of the various art painting styles. This study investigates human-AI synergies by combining the explanation produced by an explainable AI mechanism with a human expert’s insights to provide reasons for school students and a larger audience
The Magic Number: Impact of Sample Size for Dementia Screening Using Transfer Learning and Data Augmentation of Clock Drawing Test Images
peer reviewe
A comparative study of automatic classifiers to recognize speakers based on fricatives
Speakers’ voices are highly individual and for this reason speakers can be identified based on
their voice. Nevertheless, voices are often more variable within the same speaker than they are
between speakers, which makes it difficult for humans and machines to differentiate between
speakers (Hansen, J. H., & Hasan, T., 2015). To date, various machine learning methods have
been developed to recognize speakers based on the acoustic characteristics of their speech;
however, not all of them have proven equally effective in speaker identification, and depending
on the obtained techniques, the system achieves a different result. Here, different machine
learning classifiers have been applied to identify the best classification model (i.e., Naïve Bayes
(NB), support vector machines (SVM), random forests (RF), & k-nearest neighbors (KNN))
for categorizing 4 speaking styles based on the segment types (voiceless fricatives) considering
acoustic features of center of gravity, standard deviation, and skewness. We used a dataset
consisting of speech samples from 7 native Persian subjects speaking in 4 different speaking
styles: read, spontaneous, clear, and child-directed speech. The results revealed that the best
performing model to predict the speakers based on the segment type was RF model with an
accuracy of 81,3%, followed by SVM (76.3%), NB (75.4%), and KNN (74%) (Table 1). Our
results showed that the RF performed the best for voiceless fricatives /f/, /s/, and / ʃ / which
may indicate that these segments are much more speaker-specific than others (Gordon et al.,
2002), and the model performance was low for the voiceless fricatives of /h/ and /x/.
Performance can be seen in the confusion matrix (Figure 1), which produced high precision and
recall values (above 80%) for /f/, /s/ and / ʃ / (Table 2). We found that the model performance
improved when the data related to clear speaking style; the information in individual speakers
(i.e., voiceless fricatives) are more distinguishable in clear style than other styles (Table 1)
ASRLUX: AUTOMATIC SPEECH RECOGNITION FOR THE LOW-RESOURCE LANGUAGE LUXEMBOURGISH
We have developed an automatic speech recognition (ASR) system tailored to Luxembourgish, a low-resource language that poses distinct challenges for conventional ASR approaches due to the limited availability of training data and inherent multilingual nature. By employing transfer learning, we meticulously fine-tuned an array of models derived from pre-trained wav2vec 2.0 and Whisper checkpoints. These models have been trained on an extensive corpus of various languages and several hundred thousand hours of audio data, utilizing unsupervised and weak supervised methodologies, respectively. This includes linguistically related languages such as German, Dutch, and French, which expedite the cross-lingual training process for Luxembourgish-specific models. Fine-tuning was executed utilizing 67 hours of annotated Luxembourgish speech data sourced from a diverse range of speakers. The optimal word error rate (WER) achieved for wav2vec 2.0 and Whisper models were 9.5 and 12.1, respectively. The remarkably low WERs obtained serve to substantiate the efficacy of transfer learning in the context of ASR for low-resource languages
User Requirement Analysis for a Real-Time NLP-Based Open Information Retrieval Meeting Assistant
Meetings are recurrent organizational tasks intended to drive progress in an interdisciplinary and collaborative manner. They are, however, prone to inefficiency due to factors such as differing knowledge among participants. The research goal of this paper is to design a recommendation-based meeting assistant that can improve the efficiency of meetings by helping to contextualize the information being discussed and reduce distractions for listeners. Following a Wizard-of-Oz setup, we gathered user feedback by thematically analyzing focus group discussions and identifying this kind of system’s key challenges and requirements. The findings point to shortcomings in contextualization and raise concerns about distracting listeners from the main content. Based on the findings, we have developed a set of design recommendations that address context, interactivity and personalization issues. These recommendations could be useful for developing a meeting assistant that is tailored to the needs of meeting participants, thereby helping to optimize the meeting experience
Experiments of ASR-based mispronunciation detection for children and adult English learners
Pronunciation is one of the fundamentals of language learning, and it is considered a primary factor of spoken language when it
comes to an understanding and being understood by others. The persistent presence of high error rates in speech recognition domains resulting from mispronunciations motivates us to find alternative techniques
for handling mispronunciations. In this study, we develop a mispronunciation assessment system that checks the pronunciation of non-native
English speakers, identifies the commonly mispronounced phonemes of
Italian learners of English, and presents an evaluation of the non-native
pronunciation observed in phonetically annotated speech corpora. In this
work, to detect mispronunciations, we used a phone-based ASR implemented using Kaldi. We used two non-native English labeled corpora;
(i) a corpus of Italian adults contains 5,867 utterances from 46 speakers,
and (ii) a corpus of Italian children consists of 5,268 utterances from 78
children. Our results show that the selected error model can discriminate correct sounds from incorrect sounds in both native and non-native
speech, and therefore can be used to detect pronunciation errors in nonnative speech. The phone error rates show improvement in using the
error language model. Furthermore, the ASR system shows better accuracy after applying the error model on our selected corpora
EXPLORING THE USE OF PHONOLOGICAL FEATURES FOR PARKINSON’S DISEASE DETECTION
Parkinson’s disease (PD) is a neurodegenerative disorder that causes motor and non-motor symptoms. Speech impairments are one of the early symptoms of PD, but they are not always fully exploited by clinicians. In this study, the use of phonological features extracted from speech
data collected from Spanish-speaking patients was explored to predict PD patients from healthy subjects using phonet, which was trained on Spanish data, and PhonVoc, which was trained on English
data. These features were then used to train and test several machine learning models. The XGBoost
model achieved the best performance in classifying patients from HCs, with an accuracy of over 0.76.
However, the model performed better when using a phonological model trained on Spanish data rather than English data