23 research outputs found

    Increase Apparent Public Speaking Fluency By Speech Augmentation

    Full text link
    Fluent and confident speech is desirable to every speaker. But professional speech delivering requires a great deal of experience and practice. In this paper, we propose a speech stream manipulation system which can help non-professional speakers to produce fluent, professional-like speech content, in turn contributing towards better listener engagement and comprehension. We propose to achieve this task by manipulating the disfluencies in human speech, like the sounds 'uh' and 'um', the filler words and awkward long silences. Given any unrehearsed speech we segment and silence the filled pauses and doctor the duration of imposed silence as well as other long pauses ('disfluent') by a predictive model learned using professional speech dataset. Finally, we output a audio stream in which speaker sounds more fluent, confident and practiced compared to the original speech he/she recorded. According to our quantitative evaluation, we significantly increase the fluency of speech by reducing rate of pauses and fillers

    Listening to features

    Get PDF
    This work explores nonparametric methods which aim at synthesizing audio from low-dimensionnal acoustic features typically used in MIR frameworks. Several issues prevent this task to be straightforwardly achieved. Such features are designed for analysis and not for synthesis, thus favoring high-level description over easily inverted acoustic representation. Whereas some previous studies already considered the problem of synthesizing audio from features such as Mel-Frequency Cepstral Coefficients, they mainly relied on the explicit formula used to compute those features in order to inverse them. Here, we instead adopt a simple blind approach, where arbitrary sets of features can be used during synthesis and where reconstruction is exemplar-based. After testing the approach on a speech synthesis from well known features problem, we apply it to the more complex task of inverting songs from the Million Song Dataset. What makes this task harder is twofold. First, that features are irregularly spaced in the temporal domain according to an onset-based segmentation. Second the exact method used to compute these features is unknown, although the features for new audio can be computed using their API as a black-box. In this paper, we detail these difficulties and present a framework to nonetheless attempting such synthesis by concatenating audio samples from a training dataset, whose features have been computed beforehand. Samples are selected at the segment level, in the feature space with a simple nearest neighbor search. Additionnal constraints can then be defined to enhance the synthesis pertinence. Preliminary experiments are presented using RWC and GTZAN audio datasets to synthesize tracks from the Million Song Dataset.Comment: Technical Repor

    Audio Summarization with Audio Features and Probability Distribution Divergence

    Get PDF
    International audienceThe automatic summarization of multimedia sources is an important task that facilitates the understanding of an individual by condensing the source while maintaining relevant information. In this paper we focus on audio summarization based on audio features and the probability of distribution divergence. Our method, based on an extractive summarization approach, aims to select the most relevant segments until a time threshold is reached. It takes into account the segment's length, position and informativeness value. Informativeness of each segment is obtained by mapping a set of audio features issued from its Mel-frequency Cepstral Coefficients and their corresponding Jensen-Shannon divergence score. Results over a multi-evaluator scheme shows that our approach provides understandable and informative summaries

    スマートデバイスにおいて大量の楽曲から高速に選曲する手法の提案と評価

    Get PDF
    本研究では,スマートデバイスにおいて,画面のタップのリズムや位置によって楽曲を選曲するシステムを提案し,実装した.また,実験により,選曲の正確さや有用性について評価を行った.近年,AndroidやiOSのような汎用OSを搭載した音楽プレイヤーやスマートフォンで音楽を聞く機会が増加している.これらの端末は数千曲もの楽曲を保存し再生することができる.しかし,一方で,端末内の楽曲数の増加により,曲やアルバム等の選曲画面からの選曲に時間がかかってしまうという問題が発生する.そこで,本研究では,携帯端末用の直感的な選曲システムを提案した.端末の画面をタップする時のリズムやタップ位置による音程変化の情報を楽曲のボーカルの発音のリズムや音高の変化と比較することにより聴きたい曲を絞り込むシステムを設計および評価した.本研究では,楽曲の音響特徴から楽曲データベース(DB)を作成するシステム,端末の画面から入力を行うシステム,楽曲DBの楽曲データと入力されたタップデータを比較し選曲するシステムを作成した.予備実験により,入力方法は端末の画面をピアノの鍵盤のように入力する方法とした.また,アルゴリズム評価の有用性について評価を行った.入力実験により,選曲の正確さや有用性について評価を行った.RWC研究用音楽DBの100曲を対象として比較する際,MIDIデータのデータベースを用いた場合は,入力データが正確な場合,100%選曲に成功するが,被験者による入力データでは,最良で5位以内に入る確率が91.7%であった.波形データのDBを用いて選曲した際は,被験者による入力データでは最良で5位以内に入る確率は38.4%であった.また,被験者が入力したデータの音高推移の正しさは77.8%程度であった.これらの結果より,選曲時に入力とDBを比較するアルゴリズムの改良,また,ユーザーの訓練によって選曲の精度を改善することが可能であると考えられる.電気通信大学201

    Summarizing videos into a target language: Methodology, architectures and evaluation

    Get PDF
    International audienceThe aim of the work is to report the results of the Chist-Era project AMIS (Access Multilingual Information opinionS). The purpose of AMIS is to answer the following question: How to make the information in a foreign language accessible for everyone? This issue is not limited to translate a source video into a target language video since the objective is to provide only the main idea of an Arabic video in English. This objective necessitates developing research in several areas that are not, all arrived at a maturity state: Video summarization, Speech recognition, Machine translation, Audio summarization and Speech segmentation. In this article we present several possible architectures to achieve our objective, yet we focus on only one of them. The scientific locks are be presented, and we explain how to deal with them. One of the big challenges of this work is to conceive a way to evaluate objectively a system composed of several components knowing that each of them has its limits and can propagate errors through the first component. Also, a subjective evaluation procedure is proposed in which several annotators have been mobilized to test the quality of the achieved summaries
    corecore