191 research outputs found

    SALSA-TEXT : self attentive latent space based adversarial text generation

    Full text link
    Inspired by the success of self attention mechanism and Transformer architecture in sequence transduction and image generation applications, we propose novel self attention-based architectures to improve the performance of adversarial latent code- based schemes in text generation. Adversarial latent code-based text generation has recently gained a lot of attention due to their promising results. In this paper, we take a step to fortify the architectures used in these setups, specifically AAE and ARAE. We benchmark two latent code-based methods (AAE and ARAE) designed based on adversarial setups. In our experiments, the Google sentence compression dataset is utilized to compare our method with these methods using various objective and subjective measures. The experiments demonstrate the proposed (self) attention-based models outperform the state-of-the-art in adversarial code-based text generation.Comment: 10 pages, 3 figures, under review at ICLR 201

    A Survey of AI Music Generation Tools and Models

    Full text link
    In this work, we provide a comprehensive survey of AI music generation tools, including both research projects and commercialized applications. To conduct our analysis, we classified music generation approaches into three categories: parameter-based, text-based, and visual-based classes. Our survey highlights the diverse possibilities and functional features of these tools, which cater to a wide range of users, from regular listeners to professional musicians. We observed that each tool has its own set of advantages and limitations. As a result, we have compiled a comprehensive list of these factors that should be considered during the tool selection process. Moreover, our survey offers critical insights into the underlying mechanisms and challenges of AI music generation

    Toward Interactive Music Generation: A Position Paper

    Get PDF
    Music generation using deep learning has received considerable attention in recent years. Researchers have developed various generative models capable of imitating musical conventions, comprehending the musical corpora, and generating new samples based on the learning outcome. Although the samples generated by these models are persuasive, they often lack musical structure and creativity. For instance, a vanilla end-to-end approach, which deals with all levels of music representation at once, does not offer human-level control and interaction during the learning process, leading to constrained results. Indeed, music creation is a recurrent process that follows some principles by a musician, where various musical features are reused or adapted. On the other hand, a musical piece adheres to a musical style, breaking down into precise concepts of timbre style, performance style, composition style, and the coherency between these aspects. Here, we study and analyze the current advances in music generation using deep learning models through different criteria. We discuss the shortcomings and limitations of these models regarding interactivity and adaptability. Finally, we draw the potential future research direction addressing multi-agent systems and reinforcement learning algorithms to alleviate these shortcomings and limitations

    Bass Accompaniment Generation via Latent Diffusion

    Get PDF
    The ability to automatically generate music that appropriately matches an arbitrary input track is a challenging task. We present a novel controllable system for generating single stems to accompany musical mixes of arbitrary length. At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations, and a conditional latent diffusion model that takes as input the latent encoding of a mix and generates the latent encoding of a corresponding stem. To provide control over the timbre of generated samples, we introduce a technique to ground the latent space to a user-provided reference style during diffusion sampling. For further improving audio quality, we adapt classifier-free guidance to avoid distortions at high guidance strengths when generating an unbounded latent space. We train our model on a dataset of pairs of mixes and matching bass stems. Quantitative experiments demonstrate that, given an input mix, the proposed system can generate basslines with user-specified timbres. Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production

    Deep Learning Techniques for Music Generation -- A Survey

    Full text link
    This paper is a survey and an analysis of different ways of using deep learning (deep artificial neural networks) to generate musical content. We propose a methodology based on five dimensions for our analysis: Objective - What musical content is to be generated? Examples are: melody, polyphony, accompaniment or counterpoint. - For what destination and for what use? To be performed by a human(s) (in the case of a musical score), or by a machine (in the case of an audio file). Representation - What are the concepts to be manipulated? Examples are: waveform, spectrogram, note, chord, meter and beat. - What format is to be used? Examples are: MIDI, piano roll or text. - How will the representation be encoded? Examples are: scalar, one-hot or many-hot. Architecture - What type(s) of deep neural network is (are) to be used? Examples are: feedforward network, recurrent network, autoencoder or generative adversarial networks. Challenge - What are the limitations and open challenges? Examples are: variability, interactivity and creativity. Strategy - How do we model and control the process of generation? Examples are: single-step feedforward, iterative feedforward, sampling or input manipulation. For each dimension, we conduct a comparative analysis of various models and techniques and we propose some tentative multidimensional typology. This typology is bottom-up, based on the analysis of many existing deep-learning based systems for music generation selected from the relevant literature. These systems are described and are used to exemplify the various choices of objective, representation, architecture, challenge and strategy. The last section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P. Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music Generation, Computational Synthesis and Creative Systems, Springer, 201

    A Survey of Music Generation in the Context of Interaction

    Full text link
    In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amounts of recorded or transcribed music, which in turn also allows for fairly straight-forward "performance" evaluation. However, most of these models are not suitable for human-machine co-creation through live interaction, neither is clear, how such models and resulting creations would be evaluated. This article presents a thorough review of music representation, feature analysis, heuristic algorithms, statistical and parametric modelling, and human and automatic evaluation measures, along with a discussion of which approaches and models seem most suitable for live interaction

    Sequential decision modeling in uncertain conditions

    Full text link
    Cette thĂšse consiste en une sĂ©rie d’approches pour la modĂ©lisation de dĂ©cision structurĂ©e - c’est-Ă -dire qu’elle propose des solutions utilisant des modĂšles gĂ©nĂ©ratifs pour des tĂąches intĂ©grant plusieurs entrĂ©es et sorties, ces entrĂ©es et sorties Ă©tant dictĂ©es par des interactions complexes entre leurs Ă©lĂ©ments. Un aspect crucial de ces problĂšmes est la prĂ©sence en plus d’un rĂ©sultat correct, des rĂ©sultats structurellement diffĂ©rents mais considĂ©rĂ©s tout aussi corrects, rĂ©sultant d’une grande mais nĂ©cessaire incertitude sur les sorties du systĂšme. Cette thĂšse prĂ©sente quatre articles sur ce sujet, se concentrent en particulier sur le domaine de la synthĂšse vocale Ă  partir de texte, gĂ©nĂ©ration symbolique de musique, traitement de texte, reconnaissance automatique de la parole, et apprentissage de reprĂ©sentations pour la parole et le texte. Chaque article prĂ©sente une approche particuliĂšre Ă  un problĂšme dans ces domaines respectifs, en proposant et Ă©tudiant des architectures profondes pour ces domaines. Bien que ces techniques d’apprentissage profond utilisĂ©es dans ces articles sont suffisamment versatiles et expressives pour ĂȘtre utilisĂ©es dans d’autres domaines, nous resterons concentrĂ©s sur les applications dĂ©crites dans chaque article. Le premier article prĂ©sente une approche permettant le contrĂŽle dĂ©taillĂ©, au niveau phonĂ©tique et symbolique, d’un systĂšme de synthĂšse vocale, en utilisant une mĂ©thode d’échange efficace permettant de combiner des reprĂ©sentations Ă  un niveau lexical. Puisque cette combinaison permet un contrĂŽle proportionnĂ© sur les conditions d’entrĂ©e, et amĂ©liore les prononciations faisant uniquement usage de caractĂšres, ce systĂšme de combinaison pour la synthĂšse vocale a Ă©tĂ© prĂ©fĂ©rĂ© durant des tests A/B par rapport Ă  des modĂšles de rĂ©fĂ©rence Ă©quivalents utilisant les mĂȘmes modalitĂ©s. Le deuxiĂšme article se concentre sur un autre systĂšme de synthĂšse vocale, cette fois-ci centrĂ© sur la construction d’une reprĂ©sentation multi-Ă©chelle de la parole Ă  travers une dĂ©composition structurĂ©e des descripteurs audio. En particulier, l’intĂ©rĂȘt de ce travail est dans sa mĂ©thodologie Ă©conome en calcul malgrĂ© avoir Ă©tĂ© bĂąti Ă  partir de travaux antĂ©rieurs beaucoup plus demandant en ressources de calcul. Afin de bien pouvoir faire de la synthĂšse vocale sous ces contraintes computationelles, plusieurs nouvelles composantes ont Ă©tĂ© conçues et intĂ©grĂ©es Ă  ce qui devient un modĂšle efficace de synthĂšse vocale. Le troisiĂšme article un nouveau modĂšle auto-rĂ©gressif pour modĂ©liser des chaĂźnes de symboles. Ce modĂšle fait usage de prĂ©dictions et d’estimations itĂ©rative et rĂ©pĂ©tĂ©es afin de construire une sortie structurĂ©e respectant plusieurs contraintes correspondant au domaine sous-jacent. Ce modĂšle est testĂ© dans le cadre de la gĂ©nĂ©ration symbolique de musique et la modĂ©lisation de texte, faisant preuve d’excellentes performances en particulier quand la quantitĂ© de donnĂ©es s’avĂšre limitĂ©e. Le dernier article de la thĂšse se concentre sur l’étude des reprĂ©sentations pour la parole et le texte apprise Ă  partir d’un systĂšme de reconnaissance vocale d’un travail antĂ©rieur. À travers une sĂ©rie d’études systĂ©matiques utilisant des modĂšles prĂ©-entraĂźnĂ©s de texte et de durĂ©e, relations qualitatives entre les donnĂ©es de texte et de parole, et Ă©tudes de performance sur la rĂ©cupĂ©ration transmodal “few shot”, nous exposons plusieurs propriĂ©tĂ©s essentielles sous-jacent Ă  la performance du systĂšme, ouvrant la voie pour des dĂ©veloppements algorithmiques futurs. De plus, les diffĂ©rents modĂšles rĂ©sultants de cette Ă©tude obtiennent des rĂ©sultats impressionnants sur un nombre de tĂąches de rĂ©fĂ©rence utilisant des modĂšles prĂ©-entraĂźnĂ© transfĂ©rĂ© sans modification.This thesis presents a sequence of approaches to structured decision modeling - that is, proposing generative solutions to tasks with multiple inputs and outputs, featuring complicated interactions between input elements and output elements. Crucially, these problems also include a high amount of uncertainty about the correct outcome and many largely equivalent but structurally different outcomes can be considered equally correct. This thesis presents four articles about these topics, particularly focusing on the domains of text-to-speech synthesis, symbolic music generation, text processing, automatic speech recognition, and speech-text representation learning. Each article presents a particular approach to solving problems in these respective domains, focused on proposing and understanding deep learning architectures for these domains. The deep learning techniques used in these articles are broadly applicable, flexible, and powerful enough that these general approaches may find application to other areas however we remain focused on the domains discussed in each respective article. The first article presents an approach allowing for flexible phonetic and character control of a text-to-speech system, utilizing an efficient "swap-out" method for blending representations at the word level. This blending allows for smooth control over input conditions, and also strengthens character only pronunciations, resulting in a preference for a blended text-to-speech system in A/B testing, compared to an equivalent baselines even when using the same input information modalities. The second article focuses on another text-to-speech system, this time centered on building multi-scale representations of speech audio using a structured decomposition of audio features. Particularly this work focuses on a compute efficient methodology, while building on prior work which requires a much greater computational budget than the proposed system. In order to effectively perform text-to-speech synthesis under these computational constraints, a number of new components are constructed and integrated, resulting in an efficient model for text-to-speech synthesis. The third article presents a new non-autoregressive model for modeling symbolic sequences. This model uses iterative prediction and re-estimation in order to build structured outputs, which respect numerous constraints in the underlying sequence domain. This model is applied to symbolic music modeling and text modeling, showing excellent performance particularly in limited data generative settings. The final article in this thesis focuses on understanding the speech-text representations learned by a text-injected speech recognition system from prior literature. Through a systematic series of studies utilizing pre-trained text and duration models, qualitative relations between text and speech sequences, and performance studies in few-shot cross-modal retrieval, we reveal a number of crucial properties underlying the performance of this system, paving the way for future algorithmic development. In addition, model variants built during this study achieve impressive performance results on a number of benchmark tasks using partially frozen and transferred parameters
    • 

    corecore