11 research outputs found

    The 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, March 15-18, 2022

    Get PDF
    This paper traces signs of urban culture in Finnish fiction films from the 1950s by drawing on a multimodal analysis of audiovisual content. The Finnish National Filmography includes 208 feature films released between 1950–1959. Our approach to the automatic analysis of media content includes aural and visual object recognition and speech recognition. We concentrate on features that epitomize urbanity, including visual objects, such as forms of transportation (cars, horses) and sounds (rural and urban sounds, speech). Based on the scores and frequencies of these recognitions, we observe quantitative changes that took place during the 1950s. The paper demonstrates that aural and visual object recognition, as well as speech recognition, can successfully be applied in film historical analysis. The overall results support the idea that Finnish filmmakers fueled the imagination of urban life in the 1950s, paving the way for modern technologies and gradually pushing aside the signs of rural life.</p

    Binaural hearing in rooms

    No full text

    Automaattinen puheentunnistus kuuroille ja huonokuuloisille lisätyn todellisuuden sovelluksessa

    No full text
    People with hearing loss experience considerable difficulties in participating and understanding spoken communication, which has negative effects on many aspects of their life. In many proposed solutions to the problem the deaf or hard of hearing person has to take their attention away from the speaker. As a consequence the hearing impaired miss for instance gestures and expressions of the speaker. This thesis studied the use of augmented reality and automatic speech recognition technologies in an assistive mobile application for the hearing impaired. The application uses mobile augmented reality with video-based augmentations. Automatic speech recognition is done using modern neural network models. In the implementation, automatic speech recogniser transcriptions were placed in speech bubbles on top of an augmented reality view of the conversation partner. This minimised the distance between the speaker and the transcriptions, which help the hearing impaired follow the conversation. To validate the usefulness of the approach, user tests were organised with hearing impaired participants. The results show that the deaf and hard of hearing found the augmented reality view and the application helpful for following conversations. The most requested improvements by the user testers were support for visual separation and identification of speakers in group conversations and higher speech recognition accuracy.Huonokuuloisilla ja kuuroilla ihmisillä on huomattavia vaikeuksia keskusteluihin osallistumisessa ja niiden ymmärtämisessä, joka laskee heidän elämänlaatuaan monella tavalla. Suuressa osassa ongelmaan tarjotuista ratkaisuista kuurot ja huonokuuloiset joutuvat siirtämään huomionsa pois puhujasta. Tällöin kuulovammainen ei näe esimerkiksi puhujan eleitä ja ilmeitä. Tässä työssä tutkittiin lisätyn todellisuuden ja automaattisen puheentunnistuksen hyödyntämistä huonokuuloisille ja kuuroille tarkoitetussa avustavassa sovelluksessa. Sovellus käyttää video- ja mobiilipohjaista lisättyä todellisuutta. Puheentunnistuksessa hyödynnetään moderneja neuroverkkomalleja. Toteutuksessa automaattisen puheentunnistuksen tulokset sijoitettiin puhekupliin videokuvassa näkyvän puhujan kasvojen lähelle. Näin kuuro tai huonokuuloinen käyttäjä pystyi helposti seuramaan sekä puhujaa että puheentunnistustuloksia. Sovelluksen hyödyllisyyttä arvioitiin järjestämällä käyttäjätestejä kuuroille ja huonokuuloisille. Tulosten perusteella huonokuuloiset ja kuurot kokivat lisätyn todellisuuden ja sovelluksen auttavan keskustelujen seuraamisessa. Testikäyttäjien eniten toivomia parannuksia olivat eri puhujien puheentunnistustulosten visuaalinen erottelu toisistaan ja parempi puheentunnistustarkkuus

    A user study to compare two conversational assistants designed for people with hearing impairments

    No full text
    Participating in conversations can be difficult for people with hearing loss, especially in acoustically challenging environments. We studied the preferences the hearing impaired have for a personal conversation assistant based on automatic speech recognition (ASR) technology. We created two prototypes which were evaluated by hearing impaired test users. This paper qualitatively compares the two based on the feedback obtained from the tests. The first prototype was a proof-of-concept system running real-time ASR on a laptop. The second prototype was developed for a mobile device with the recognizer running on a separate server. In the mobile device, augmented reality (AR) was used to help the hearing impaired observe gestures and lip movements of the speaker simultaneously with the transcriptions. Several testers found the systems useful enough to use in their daily lives, with majority preferring the mobile AR version. The biggest concern of the testers was the accuracy of the transcriptions and the lack of speaker identification.Peer reviewe

    Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0

    No full text
    Funding Information: We are grateful for the Academy of Finland project funding, numbers: 337073, 345790. We acknowledge the computational resources provided by the Aalto Science-IT project. Publisher Copyright: Copyright © 2022 ISCA.Low resource speech recognition can potentially benefit a lot from exploiting a pretrained model such as wav2vec 2.0. These pretrained models have learned useful representations in an unsupervised or self-supervised task, often leveraging a very large corpus of untranscribed speech. The pretrained models can then be used in various ways. In this work we compare two approaches which exploit wav2vec 2.0: an attention-based end-to-end model (AED), where the wav2vec 2.0 model is used in the model encoder, and a hybrid hidden Markov model (HMM/DNN) speech recognition system, where the wav2vec 2.0 model is used in the acoustic model. These approaches are compared in a very difficult Northern Sámi task, as well as an easier, simulated low resource task in Finnish. We find that the wav2vec 2.0 AED models can learn a working attention mechanism, but are still outperformed by wav2vec 2.0 HMM/DNN systems. Our best wav2vec 2.0 HMM/DNN recipe on 20 hours is competitive with an HMM/DNN system trained on 1600 hours.Peer reviewe

    Finnish parliament ASR corpus

    No full text
    Funding Information: This work has been supported by the MeMAD project of the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 780069) and the Academy of Finland project funding grants numbers 329267, 337073, and 345790. We also thank Aalto ScienceIT for providing us with computational resources. Publisher Copyright: © 2023, The Author(s). | openaire: EC/H2020/780069/EU//MeMADPublic sources like parliament meeting recordings and transcripts provide ever-growing material for the training and evaluation of automatic speech recognition (ASR) systems. In this paper, we publish and analyse the Finnish Parliament ASR Corpus, the most extensive publicly available collection of manually transcribed speech data for Finnish with over 3000 h of speech and 449 speakers for which it provides rich demographic metadata. This corpus builds on earlier initial work, and as a result the corpus has a natural split into two training subsets from two periods of time. Similarly, there are two official, corrected test sets covering different times, setting an ASR task with longitudinal distribution-shift characteristics. An official development set is also provided. We developed a complete Kaldi-based data preparation pipeline and ASR recipes for hidden Markov models (HMM), hybrid deep neural networks (HMM-DNN), and attention-based encoder-decoders (AED). For HMM-DNN systems, we provide results with time-delay neural networks (TDNN) as well as state-of-the-art wav2vec 2.0 pretrained acoustic models. We set benchmarks on the official test sets and multiple other recently used test sets. Both temporal corpus subsets are already large, and we observe that beyond their scale, HMM-TDNN ASR performance on the official test sets has reached a plateau. In contrast, other domains and larger wav2vec 2.0 models benefit from added data. The HMM-DNN and AED approaches are compared in a carefully matched equal data setting, with the HMM-DNN system consistently performing better. Finally, the variation of the ASR accuracy is compared between the speaker categories available in the parliament metadata to detect potential biases based on factors such as gender, age, and education.Peer reviewe

    Discovering Relevant Sub-spaces of BERT, Wav2Vec 2.0, ELECTRA and ViT Embeddings for Humor and Mimicked Emotion Recognition with Integrated Gradients

    No full text
    Large-scale, pre-trained models revolutionized the field of sentiment analysis and enabled multimodal systems to be quickly developed. In this paper, we address two challenges posed by the Multimodal Sentiment Analysis (MuSe) 2023 competition by focusing on automatically detecting cross-cultural humor and predicting three continuous emotion targets from user-generated videos. Multiple methods in the literature already demonstrate the importance of embedded features generated by popular pre-trained neural solutions. Based on their success, we can assume that the embedded space consists of several sub-spaces relevant to different tasks. Our aim is to automatically identify the task-specific sub-spaces of various embeddings by interpreting the baseline neural models. Once the relevant dimensions are located, we train a new model using only those features, which leads to similar or slightly better results with a considerably smaller and faster model. The best Humor Detection model using only the relevant sub-space of audio embeddings contained approximately 54% fewer parameters than the one processing the whole encoded vector, required 48% less time to be trained and even outperformed the larger model. Our empirical results validate that, indeed, only a portion of the embedding space is needed to achieve good performance. Our solution could be considered a novel form of knowledge distillation, which enables new ways of transferring knowledge from one model into another.Peer reviewe

    Tracing Signs of Urbanity in the Finnish Fiction Film of the 1950s: Toward a Multimodal Analysis of Audiovisual Data

    No full text
    Funding Information: This work was supported by the research consortium Movie Making Finland: Finnish fiction films as audiovisual big data, 1907-2017 (MoMaF), funded by the Academy of Finland (329266). Film data and metadata were provided by the National Audiovisual Institute in Finland and computational resources by CSC-IT Center for Science, Espoo, Finland. Funding Information: This work was supported by the research consortium Movie Making Finland: Finnish fiction films as audiovisual big data, 1907–2017 (MoMaF), funded by the Academy of Finland (329266). Film data and metadata were provided by the National Audiovisual Institute in Finland and computational resources by CSC—IT Center for Science, Espoo, Finland. Publisher Copyright: © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)This paper traces signs of urban culture in Finnish fiction films from the 1950s by drawing on a multimodal analysis of audiovisual content. The Finnish National Filmography includes 208 feature films released between 1950-1959. Our approach to the automatic analysis of media content includes aural and visual object recognition and speech recognition. We concentrate on features that epitomize urbanity, including visual objects, such as forms of transportation (cars, horses) and sounds (rural and urban sounds, speech). Based on the scores and frequencies of these recognitions, we observe quantitative changes that took place during the 1950s. The paper demonstrates that aural and visual object recognition, as well as speech recognition, can successfully be applied in film historical analysis. The overall results support the idea that Finnish filmmakers fueled the imagination of urban life in the 1950s, paving the way for modern technologies and gradually pushing aside the signs of rural life.Peer reviewe
    corecore