1,870 research outputs found

    RepCodec: A Speech Representation Codec for Speech Tokenization

    Full text link
    With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec. We believe our method can facilitate large language modeling research on speech processing

    A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

    Full text link
    In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of semantic information. Though it has become more common to complement the systems with extra language models, their performance in modeling rising intonation is not well studied. In this paper, we propose to complement the Cantonese TTS model with a BERT-based statement/question classifier. We design different training strategies and compare their performance. We conduct our experiments on a Cantonese corpus named CanTTS. Empirical results show that the separate training approach obtains the best generalization performance and feasibility.Comment: Accepted by INTERSPEECH 202

    CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

    Full text link
    Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task.Comment: Submitted to ICASSP 202

    Eigentrigraphemes for under-resourced languages

    Get PDF
    Abstract Grapheme-based modeling has an advantage over phone-based modeling in automatic speech recognition for under-resourced languages when a good dictionary is not available. Recently we proposed a new method for parameter estimation of context-dependent hidden Markov model (HMM) called eigentriphone modeling. Eigentriphone modeling outperforms conventional tied-state HMM by eliminating the quantization errors among the tied states. The eigentriphone modeling framework is very flexible and can be applied to any group of modeling unit provided that they may be represented by vectors of the same dimension. In this paper, we would like to port the eigentriphone modeling method from a phone-based system to a grapheme-based system; the new method will be called eigentrigrapheme modeling. Experiments on four official South African under-resourced languages (Afrikaans, South African English, Sesotho, siSwati) show that the new eigentrigrapheme modeling method reduces the word error rates of conventional tied-state trigrapheme modeling by an average of 4.08% relative

    Leveraging per Image-Token Consistency for Vision-Language Pre-training

    Full text link
    Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether they are consistent with the image (i.e., Image-Text Consistent Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks

    A new type of slumping-induced soft-sediment deformation structure: the envelope structure

    Get PDF
    The sediments of the Cretaceous Gyeokpori Formation in south-western South Korea accumulated in a lake in which mainly siliciclastic rocks were deposited, with some interbedded volcaniclastics. The nearby volcanic activity resulted in unstable lake margins inducing a dominance of gravity-flow deposits. The high sedimentation rate facilitated soft-sediment deformation on the sloping margin. The deposition of numerous gravity-flow deposits resulted in a vertically heterolithic stratification. The slumps are composed of different lithologies, which is expressed in different types of deformation due to the difference in cohesion between sandy and mussy layers within the slumps. Coarser-grained (cohesionless) slumps tend to show more chaotic deformation of their lamination or layering. The difference in slumping behaviour of the cohesive and non-cohesive examples is explained and modelled. A unique soft-sediment deformation structure is recognized. This structure has not been described before, and we call it ‘envelope structure’. It consists of a conglomerate mass that has become entirely embedded in fine-grained sediment because slope failure took place and the fine-grained material slumped down with the conglomerate ‘at its back’. The cohesive laminated mudstone formed locally slump folds that embedded the non-cohesive overlying conglomerate unit, possibly partly due to the bulldozing effect of the latter. This structure presumably can develop when the density contrast with the underlying and overlying deposits is exceptionally high. The envelope structure should be regarded as a special – and rare – type of a slumping-induced deformation structure

    GigaST: A 10,000-hour Pseudo Speech Translation Corpus

    Full text link
    This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C English-German benchmark test set. We provide a detailed description of the translation process and verify its quality. We make the translated text data public and hope to facilitate research in speech translation. Additionally, we also release the training scripts on NeurST to make it easy to replicate our systems. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST.Comment: Submitted to Interspeech 2022. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaS

    From overtourism to undertourism... and back? The struggle to manage tourism regrowth in post-pandemic Amsterdam

    Get PDF
    Amsterdam is one of many cities that has struggled with problems of overtourism in recent years. These problems include nuisance, crowdedness, rising housing prices and economic dependence on tourism. City administrators were aware of these issues and took a variety of measures before the onset of the COVID-19 pandemic, such as placing restrictions on tourism rental (Airbnb) and setting up campaigns to tackle problem behaviour of tourists. Yet when COVID-19 halted the stream of tourists visiting Amsterdam, this created a unique opportunity to make more drastic changes, which the administration has addressed by proposing a new series of measures to proactively help contain tourism regrowth in the post-pandemic period. In this article we critically analyze these strategies to ascertain the extent to which they appear able to address the various issues of pre-pandemic overtourism to which they are aimed. Our analysis demonstrate that while trajectories already started before the pandemic are now being augmented and given more priority and some new strategies to further curb tourism growth are also being implemented, overall Amsterdam remains dominated by a growth-oriented approach to tourism planning
    • …
    corecore