113,433 research outputs found
Examining the contributions of automatic speech transcriptions and metadata sources for searching spontaneous conversational speech
The searching spontaneous speech can be enhanced by combining automatic speech transcriptions with semantically
related metadata. An important question is what can be expected from search of such transcriptions and different
sources of related metadata in terms of retrieval effectiveness. The Cross-Language Speech Retrieval (CL-SR) track at recent CLEF workshops provides a spontaneous speech
test collection with manual and automatically derived metadata fields. Using this collection we investigate the comparative search effectiveness of individual fields comprising automated transcriptions and the available metadata. A further important question is how transcriptions and metadata should be combined for the greatest benefit to search accuracy. We compare simple field merging of individual fields with the extended BM25 model for weighted field combination (BM25F). Results indicate that BM25F can produce improved search accuracy, but that it is currently important to set its parameters suitably using a suitable training set
Gamesourcing Mismatched Transcription
Transcribed speech is an essential resource to develop speech technologies
for different languages of the world. However, native speakers
of most languages of the world may not be readily available online
to acquire transcribed speech. The goal of this research is to explore
the possibility of acquiring transcriptions for speech data from
non-native speakers of a language, referred to as mismatched transcriptions.
The two main problems tackled in this work are: 1) How
do we motivate non-native speakers to provide transcriptions? 2)
How do we refine the mismatched transcriptions? Firstly, we design
a novel game that facilitates the collection of mismatched transcriptions
from non-native speakers. In this game, players are prompted
to listen to sound clips in a foreign language and asked to transcribe
the sounds they hear to the best of their abilities using English text.
The misperceptions by the non-native speakers are modeled as a finite
memory process and implemented using finite state machines.
The mismatched transcriptions are further refined using a series of
finite-state operations.
The main contributions of this thesis are as follows: 1) Creation
of a streamlined game for crowdsourcing transcriptions for speech
data from non-native speakers. 2) Algorithms that process the resulting
mismatched transcriptions and provide the closest sounding
English words. 3) Experiments describing various modifications to
the above-mentioned algorithms and results showing their effect on
the accuracy of the English words that are produced as output.Ope
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Most speech and language technologies are trained with massive amounts of
speech and text information. However, most of the world languages do not have
such resources or stable orthography. Systems constructed under these almost
zero resource conditions are not only promising for speech technology but also
for computational language documentation. The goal of computational language
documentation is to help field linguists to (semi-)automatically analyze and
annotate audio recordings of endangered and unwritten languages. Example tasks
are automatic phoneme discovery or lexicon discovery from the speech signal.
This paper presents a speech corpus collected during a realistic language
documentation process. It is made up of 5k speech utterances in Mboshi (Bantu
C25) aligned to French text translations. Speech transcriptions are also made
available: they correspond to a non-standard graphemic form close to the
language phonology. We present how the data was collected, cleaned and
processed and we illustrate its use through a zero-resource task: spoken term
discovery. The dataset is made available to the community for reproducible
computational language documentation experiments and their evaluation.Comment: accepted to LREC 201
Topic Modeling for Automatic Analysis of Natural Language: A Case Study in an Italian Customer Support Center
This paper focuses on the automatic analysis of conversation transcriptions in the call center of a customer care service. The goal is to recognize topics related to problems and complaints discussed in several dialogues between customers and agents. Our study aims to implement a framework able to automatically cluster conversation transcriptions into cohesive and well-separated groups based on the content of the data. The framework can alleviate the analyst selecting proper values for the analysis and the clustering processes. To pursue this goal, we consider a probabilistic model based on the latent Dirichlet allocation, which associates transcriptions with a mixture of topics in different proportions. A case study consisting of transcriptions in the Italian natural language, and collected in a customer support center of an energy supplier, is considered in the paper. Performance comparison of different inference techniques is discussed using the case study. The experimental results demonstrate the approachâs efficacy in clustering Italian conversation transcriptions. It also results in a practical tool to simplify the analytic process and off-load the parameter tuning from the end-user. According to recent works in the literature, this paper may be valuable for introducing latent Dirichlet allocation approaches in topic modeling for the Italian natural language
A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Despite the recent advances in opinion mining for written reviews, few works
have tackled the problem on other sources of reviews. In light of this issue,
we propose a multi-modal approach for mining fine-grained opinions from video
reviews that is able to determine the aspects of the item under review that are
being discussed and the sentiment orientation towards them. Our approach works
at the sentence level without the need for time annotations and uses features
derived from the audio, video and language transcriptions of its contents. We
evaluate our approach on two datasets and show that leveraging the video and
audio modalities consistently provides increased performance over text-only
baselines, providing evidence these extra modalities are key in better
understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202
Blue: A Stylistic Analysis Of Blue Mitchell
This study examined ten transcriptions of improvised solos by trumpeter Blue Mitchell. Mitchell was a prominent performer in the hard bop style known for his tenure with the Horace Silver Quintet. The transcriptions were selected from recordings produced between 1958 and 1977. These pieces include âSir John,â âPeace,â âStrollinâ,â âWhy Do I Love You?,â âChickâs Tune,â âGingerbread Boy,â âI Love You,â âQuiet Riot,â âSilver Blue,â and âOW!â The chosen transcriptions were analyzed and compared for common melodic devices such as motivic development, use of sequence, blues language, bebop language, and melodic quotes from other songs, which are all common elements of jazz vocabulary. This study illustrates the elements that Mitchell chose to use, and more importantly, how he chose to use them. The overall structure of Mitchellâs solos was also examined. Mitchell was a very lyrical musician and his solos tended to be symmetrical in construction
VGM-RNN: Recurrent Neural Networks for Video Game Music Generation
The recent explosion of interest in deep neural networks has affected and in some cases reinvigorated work in fields as diverse as natural language processing, image recognition, speech recognition and many more. For sequence learning tasks, recurrent neural networks and in particular LSTM-based networks have shown promising results. Recently there has been interest â for example in the research by Googleâs Magenta team â in applying so-called âlanguage modelingâ recurrent neural networks to musical tasks, including for the automatic generation of original music. In this work we demonstrate our own LSTM-based music language modeling recurrent network. We show that it is able to learn musical features from a MIDI dataset and generate output that is musically interesting while demonstrating features of melody, harmony and rhythm. We source our dataset from VGMusic.com, a collection of user-submitted MIDI transcriptions of video game songs, and attempt to generate output which emulates this kind of music
Language independent and unsupervised acoustic models for speech recognition and keyword spotting
Copyright © 2014 ISCA. Developing high-performance speech processing systems for low-resource languages is very challenging. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to train a multi-language bottleneck DNN. Language dependent and/or multi-language (all training languages) Tandem acoustic models (AM) are then trained. This work considers a particular scenario where the target language is unseen in multi-language training and has limited language model training data, a limited lexicon, and acoustic training data without transcriptions. A zero acoustic resources case is first described where a multilanguage AM is directly applied, as a language independent AM (LIAM), to an unseen language. Secondly, in an unsupervised approach a LIAM is used to obtain hypotheses for the target language acoustic data transcriptions which are then used in training a language dependent AM. 3 languages from the IARPA Babel project are used for assessment: Vietnamese, Haitian Creole and Bengali. Performance of the zero acoustic resources system is found to be poor, with keyword spotting at best 60% of language dependent performance. Unsupervised language dependent training yields performance gains. For one language (Haitian Creole) the Babel target is achieved on the in-vocabulary data
Sound comparisons: a new online database and resource for research in phonetic diversity
Sound Comparisons hosts over 90,000 individual word recordings and 50,000 narrow phonetic transcriptions from 600 language varieties from eleven language families around the world. This resource is designed to serve researchers in phonetics, phonology and related fields. Transcriptions follow new initiatives for standardisation in usage of the IPA and Unicode. At soundcomparisons.com, users can explore the transcription datasets by phonetically-informed search and filtering, customise selections of languages and words, download any targeted data subset (sound files and transcriptions) and cite it through a custom URL. We present sample research applications based on our extensive overage of regional and sociolinguistic variation within major languages, and also of endangered languages, for which Sound Comparisons provides a rapid first documentation of their diversity in phonetics. The multilingual interface and user-friendly, âhover-tohearâ maps likewise constitute an outreach tool, where speakers can instantaneously hear and compare the phonetic diversity and relationships of their native languages
Identifying Speakers and Limiting Displayed Transcription to Select Speakers
Machine-generated speech transcriptions are helpful for people that are hard of hearing or individuals who donât understand the spoken language to understand conversations that take place near them. Augmented reality glasses can display speech transcriptions to help users understand such conversations. However, the display of transcriptions can be distracting when the user wearing the AR glasses is herself speaking. This disclosure describes techniques to automatically identify speakers in a conversation and only display live transcriptions of speech from conversation participants other than the person to whom the transcription is being provided. Speakers may be identified using user-permitted factors, e.g., by use of a head-related transfer function (HRTF), biometric voice recognition, or facial feature (e.g., lip movement) recognition
- âŠ