113,408 research outputs found

    Examining the contributions of automatic speech transcriptions and metadata sources for searching spontaneous conversational speech

    Get PDF
    The searching spontaneous speech can be enhanced by combining automatic speech transcriptions with semantically related metadata. An important question is what can be expected from search of such transcriptions and different sources of related metadata in terms of retrieval effectiveness. The Cross-Language Speech Retrieval (CL-SR) track at recent CLEF workshops provides a spontaneous speech test collection with manual and automatically derived metadata fields. Using this collection we investigate the comparative search effectiveness of individual fields comprising automated transcriptions and the available metadata. A further important question is how transcriptions and metadata should be combined for the greatest benefit to search accuracy. We compare simple field merging of individual fields with the extended BM25 model for weighted field combination (BM25F). Results indicate that BM25F can produce improved search accuracy, but that it is currently important to set its parameters suitably using a suitable training set

    Gamesourcing Mismatched Transcription

    Get PDF
    Transcribed speech is an essential resource to develop speech technologies for different languages of the world. However, native speakers of most languages of the world may not be readily available online to acquire transcribed speech. The goal of this research is to explore the possibility of acquiring transcriptions for speech data from non-native speakers of a language, referred to as mismatched transcriptions. The two main problems tackled in this work are: 1) How do we motivate non-native speakers to provide transcriptions? 2) How do we refine the mismatched transcriptions? Firstly, we design a novel game that facilitates the collection of mismatched transcriptions from non-native speakers. In this game, players are prompted to listen to sound clips in a foreign language and asked to transcribe the sounds they hear to the best of their abilities using English text. The misperceptions by the non-native speakers are modeled as a finite memory process and implemented using finite state machines. The mismatched transcriptions are further refined using a series of finite-state operations. The main contributions of this thesis are as follows: 1) Creation of a streamlined game for crowdsourcing transcriptions for speech data from non-native speakers. 2) Algorithms that process the resulting mismatched transcriptions and provide the closest sounding English words. 3) Experiments describing various modifications to the above-mentioned algorithms and results showing their effect on the accuracy of the English words that are produced as output.Ope

    A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

    Full text link
    Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.Comment: accepted to LREC 201

    Topic Modeling for Automatic Analysis of Natural Language: A Case Study in an Italian Customer Support Center

    Get PDF
    This paper focuses on the automatic analysis of conversation transcriptions in the call center of a customer care service. The goal is to recognize topics related to problems and complaints discussed in several dialogues between customers and agents. Our study aims to implement a framework able to automatically cluster conversation transcriptions into cohesive and well-separated groups based on the content of the data. The framework can alleviate the analyst selecting proper values for the analysis and the clustering processes. To pursue this goal, we consider a probabilistic model based on the latent Dirichlet allocation, which associates transcriptions with a mixture of topics in different proportions. A case study consisting of transcriptions in the Italian natural language, and collected in a customer support center of an energy supplier, is considered in the paper. Performance comparison of different inference techniques is discussed using the case study. The experimental results demonstrate the approach’s efficacy in clustering Italian conversation transcriptions. It also results in a practical tool to simplify the analytic process and off-load the parameter tuning from the end-user. According to recent works in the literature, this paper may be valuable for introducing latent Dirichlet allocation approaches in topic modeling for the Italian natural language

    A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews

    Get PDF
    Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multi-modal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202

    Blue: A Stylistic Analysis Of Blue Mitchell

    Get PDF
    This study examined ten transcriptions of improvised solos by trumpeter Blue Mitchell. Mitchell was a prominent performer in the hard bop style known for his tenure with the Horace Silver Quintet. The transcriptions were selected from recordings produced between 1958 and 1977. These pieces include “Sir John,” “Peace,” “Strollin’,” “Why Do I Love You?,” “Chick’s Tune,” “Gingerbread Boy,” “I Love You,” “Quiet Riot,” “Silver Blue,” and “OW!” The chosen transcriptions were analyzed and compared for common melodic devices such as motivic development, use of sequence, blues language, bebop language, and melodic quotes from other songs, which are all common elements of jazz vocabulary. This study illustrates the elements that Mitchell chose to use, and more importantly, how he chose to use them. The overall structure of Mitchell’s solos was also examined. Mitchell was a very lyrical musician and his solos tended to be symmetrical in construction

    VGM-RNN: Recurrent Neural Networks for Video Game Music Generation

    Get PDF
    The recent explosion of interest in deep neural networks has affected and in some cases reinvigorated work in fields as diverse as natural language processing, image recognition, speech recognition and many more. For sequence learning tasks, recurrent neural networks and in particular LSTM-based networks have shown promising results. Recently there has been interest – for example in the research by Google’s Magenta team – in applying so-called “language modeling” recurrent neural networks to musical tasks, including for the automatic generation of original music. In this work we demonstrate our own LSTM-based music language modeling recurrent network. We show that it is able to learn musical features from a MIDI dataset and generate output that is musically interesting while demonstrating features of melody, harmony and rhythm. We source our dataset from VGMusic.com, a collection of user-submitted MIDI transcriptions of video game songs, and attempt to generate output which emulates this kind of music

    Language independent and unsupervised acoustic models for speech recognition and keyword spotting

    Get PDF
    Copyright © 2014 ISCA. Developing high-performance speech processing systems for low-resource languages is very challenging. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to train a multi-language bottleneck DNN. Language dependent and/or multi-language (all training languages) Tandem acoustic models (AM) are then trained. This work considers a particular scenario where the target language is unseen in multi-language training and has limited language model training data, a limited lexicon, and acoustic training data without transcriptions. A zero acoustic resources case is first described where a multilanguage AM is directly applied, as a language independent AM (LIAM), to an unseen language. Secondly, in an unsupervised approach a LIAM is used to obtain hypotheses for the target language acoustic data transcriptions which are then used in training a language dependent AM. 3 languages from the IARPA Babel project are used for assessment: Vietnamese, Haitian Creole and Bengali. Performance of the zero acoustic resources system is found to be poor, with keyword spotting at best 60% of language dependent performance. Unsupervised language dependent training yields performance gains. For one language (Haitian Creole) the Babel target is achieved on the in-vocabulary data

    Sound comparisons: a new online database and resource for research in phonetic diversity

    Get PDF
    Sound Comparisons hosts over 90,000 individual word recordings and 50,000 narrow phonetic transcriptions from 600 language varieties from eleven language families around the world. This resource is designed to serve researchers in phonetics, phonology and related fields. Transcriptions follow new initiatives for standardisation in usage of the IPA and Unicode. At soundcomparisons.com, users can explore the transcription datasets by phonetically-informed search and filtering, customise selections of languages and words, download any targeted data subset (sound files and transcriptions) and cite it through a custom URL. We present sample research applications based on our extensive overage of regional and sociolinguistic variation within major languages, and also of endangered languages, for which Sound Comparisons provides a rapid first documentation of their diversity in phonetics. The multilingual interface and user-friendly, ‘hover-tohear’ maps likewise constitute an outreach tool, where speakers can instantaneously hear and compare the phonetic diversity and relationships of their native languages

    Identifying Speakers and Limiting Displayed Transcription to Select Speakers

    Get PDF
    Machine-generated speech transcriptions are helpful for people that are hard of hearing or individuals who don’t understand the spoken language to understand conversations that take place near them. Augmented reality glasses can display speech transcriptions to help users understand such conversations. However, the display of transcriptions can be distracting when the user wearing the AR glasses is herself speaking. This disclosure describes techniques to automatically identify speakers in a conversation and only display live transcriptions of speech from conversation participants other than the person to whom the transcription is being provided. Speakers may be identified using user-permitted factors, e.g., by use of a head-related transfer function (HRTF), biometric voice recognition, or facial feature (e.g., lip movement) recognition
    • 

    corecore