10 research outputs found

    IRRMA: An Image Recommender Robot Meeting Assistant

    Get PDF
    The number of people who attend virtual meetings has increased as a result of COVID-19. In this paper, we present a system that consists of an expressive humanoid social robot called QTRobot, and a recommender system that employs natural language processing techniques to recommend images related to the content of the presenter’s speech to the audience in real time. This is achieved utilising the QTRobot’s platform capabilities (microphone, computation power, and Wi-Fi)

    The Prosody of Cheering in Sport Events

    Get PDF
    Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering, namely inciting cheering for single addressees in sport events (here, long-distance running), using the name of that person. 31 native speakers of German took part in the experiment. They were asked to cheer up an individual marathon runner in a sporting event represented by video by producing his or her name (1-5 syllables long). For reasons of comparison, the participants also produced the same names in isolation and carrier sentences. Our results reveal that speakers use different strategies to meet their motivational communicative goals: while some speakers produced the runners’ names by dividing them into syllables, others pronounced the names as quickly as possible putting more emphasis on the first syllable. A few speakers followed a mixed strategy. Contrary to our expectations, it was not the intensity that mostly contributes to the differences between the different speaking styles (cheering vs. neutral), at least in the methods we were using. Rather, participants employed higher fundamental frequency and longer duration when cheering for marathon runners

    LUX-ASR: Building an ASR system for the Luxembourgish language

    Get PDF
    We present a first system for automatic speech recognition (ASR) for the low-resource language Luxembourgish. By applying transfer-learning, we were able to fine-tune Meta’s wav2vec2-xls-r-300m checkpoint with 35 hours of labeled Luxembourgish speech data. The best word error rate received lies at 14.47

    XAI: Using Smart Photobooth for Explaining History of Art

    Get PDF
    The rise of Artificial Intelligence has led to advancements in daily life, including applications in industries, telemedicine, farming, and smart cities. It is necessary to have human-AI synergies to guarantee user engagement and provide interactive expert knowledge, despite AI’s success in "less technical" fields. In this article, the possible synergies between humans and AI to explain the development of art history and artistic style transfer are discussed. This study is part of the "Smart Photobooth" project that is able to automatically transform a user’s picture into a well-known artistic style as an interactive approach to introduce the fundamentals of the history of art to the common people and provide them with a concise explanation of the various art painting styles. This study investigates human-AI synergies by combining the explanation produced by an explainable AI mechanism with a human expert’s insights to provide reasons for school students and a larger audience

    A comparative study of automatic classifiers to recognize speakers based on fricatives

    No full text
    Speakers’ voices are highly individual and for this reason speakers can be identified based on their voice. Nevertheless, voices are often more variable within the same speaker than they are between speakers, which makes it difficult for humans and machines to differentiate between speakers (Hansen, J. H., & Hasan, T., 2015). To date, various machine learning methods have been developed to recognize speakers based on the acoustic characteristics of their speech; however, not all of them have proven equally effective in speaker identification, and depending on the obtained techniques, the system achieves a different result. Here, different machine learning classifiers have been applied to identify the best classification model (i.e., Naïve Bayes (NB), support vector machines (SVM), random forests (RF), & k-nearest neighbors (KNN)) for categorizing 4 speaking styles based on the segment types (voiceless fricatives) considering acoustic features of center of gravity, standard deviation, and skewness. We used a dataset consisting of speech samples from 7 native Persian subjects speaking in 4 different speaking styles: read, spontaneous, clear, and child-directed speech. The results revealed that the best performing model to predict the speakers based on the segment type was RF model with an accuracy of 81,3%, followed by SVM (76.3%), NB (75.4%), and KNN (74%) (Table 1). Our results showed that the RF performed the best for voiceless fricatives /f/, /s/, and / ʃ / which may indicate that these segments are much more speaker-specific than others (Gordon et al., 2002), and the model performance was low for the voiceless fricatives of /h/ and /x/. Performance can be seen in the confusion matrix (Figure 1), which produced high precision and recall values (above 80%) for /f/, /s/ and / ʃ / (Table 2). We found that the model performance improved when the data related to clear speaking style; the information in individual speakers (i.e., voiceless fricatives) are more distinguishable in clear style than other styles (Table 1)

    ASRLUX: AUTOMATIC SPEECH RECOGNITION FOR THE LOW-RESOURCE LANGUAGE LUXEMBOURGISH

    Get PDF
    We have developed an automatic speech recognition (ASR) system tailored to Luxembourgish, a low-resource language that poses distinct challenges for conventional ASR approaches due to the limited availability of training data and inherent multilingual nature. By employing transfer learning, we meticulously fine-tuned an array of models derived from pre-trained wav2vec 2.0 and Whisper checkpoints. These models have been trained on an extensive corpus of various languages and several hundred thousand hours of audio data, utilizing unsupervised and weak supervised methodologies, respectively. This includes linguistically related languages such as German, Dutch, and French, which expedite the cross-lingual training process for Luxembourgish-specific models. Fine-tuning was executed utilizing 67 hours of annotated Luxembourgish speech data sourced from a diverse range of speakers. The optimal word error rate (WER) achieved for wav2vec 2.0 and Whisper models were 9.5 and 12.1, respectively. The remarkably low WERs obtained serve to substantiate the efficacy of transfer learning in the context of ASR for low-resource languages

    User Requirement Analysis for a Real-Time NLP-Based Open Information Retrieval Meeting Assistant

    No full text
    Meetings are recurrent organizational tasks intended to drive progress in an interdisciplinary and collaborative manner. They are, however, prone to inefficiency due to factors such as differing knowledge among participants. The research goal of this paper is to design a recommendation-based meeting assistant that can improve the efficiency of meetings by helping to contextualize the information being discussed and reduce distractions for listeners. Following a Wizard-of-Oz setup, we gathered user feedback by thematically analyzing focus group discussions and identifying this kind of system’s key challenges and requirements. The findings point to shortcomings in contextualization and raise concerns about distracting listeners from the main content. Based on the findings, we have developed a set of design recommendations that address context, interactivity and personalization issues. These recommendations could be useful for developing a meeting assistant that is tailored to the needs of meeting participants, thereby helping to optimize the meeting experience

    Experiments of ASR-based mispronunciation detection for children and adult English learners

    Get PDF
    Pronunciation is one of the fundamentals of language learning, and it is considered a primary factor of spoken language when it comes to an understanding and being understood by others. The persistent presence of high error rates in speech recognition domains resulting from mispronunciations motivates us to find alternative techniques for handling mispronunciations. In this study, we develop a mispronunciation assessment system that checks the pronunciation of non-native English speakers, identifies the commonly mispronounced phonemes of Italian learners of English, and presents an evaluation of the non-native pronunciation observed in phonetically annotated speech corpora. In this work, to detect mispronunciations, we used a phone-based ASR implemented using Kaldi. We used two non-native English labeled corpora; (i) a corpus of Italian adults contains 5,867 utterances from 46 speakers, and (ii) a corpus of Italian children consists of 5,268 utterances from 78 children. Our results show that the selected error model can discriminate correct sounds from incorrect sounds in both native and non-native speech, and therefore can be used to detect pronunciation errors in nonnative speech. The phone error rates show improvement in using the error language model. Furthermore, the ASR system shows better accuracy after applying the error model on our selected corpora

    EXPLORING THE USE OF PHONOLOGICAL FEATURES FOR PARKINSON’S DISEASE DETECTION

    Get PDF
    Parkinson’s disease (PD) is a neurodegenerative disorder that causes motor and non-motor symptoms. Speech impairments are one of the early symptoms of PD, but they are not always fully exploited by clinicians. In this study, the use of phonological features extracted from speech data collected from Spanish-speaking patients was explored to predict PD patients from healthy subjects using phonet, which was trained on Spanish data, and PhonVoc, which was trained on English data. These features were then used to train and test several machine learning models. The XGBoost model achieved the best performance in classifying patients from HCs, with an accuracy of over 0.76. However, the model performed better when using a phonological model trained on Spanish data rather than English data
    corecore