70 research outputs found

    ์กฐ๊ฑด๋ถ€ ์ž๊ธฐํšŒ๊ท€ํ˜• ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ์ œ์–ด ๊ฐ€๋Šฅํ•œ ๊ฐ€์ฐฝ ์Œ์„ฑ ํ•ฉ์„ฑ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์ง€๋Šฅ์ •๋ณด์œตํ•ฉํ•™๊ณผ, 2022. 8. ์ด๊ต๊ตฌ.Singing voice synthesis aims at synthesizing a natural singing voice from given input information. A successful singing synthesis system is important not only because it can significantly reduce the cost of the music production process, but also because it helps to more easily and conveniently reflect the creator's intentions. However, there are three challenging problems in designing such a system - 1) It should be possible to independently control the various elements that make up the singing. 2) It must be possible to generate high-quality sound sources, 3) It is difficult to secure sufficient training data. To deal with this problem, we first paid attention to the source-filter theory, which is a representative speech production modeling technique. We tried to secure training data efficiency and controllability at the same time by modeling a singing voice as a convolution of the source, which is pitch information, and filter, which is the pronunciation information, and designing a structure that can model each independently. In addition, we used a conditional autoregressive model-based deep neural network to effectively model sequential data in a situation where conditional inputs such as pronunciation, pitch, and speaker are given. In order for the entire framework to generate a high-quality sound source with a distribution more similar to that of a real singing voice, the adversarial training technique was applied to the training process. Finally, we applied a self-supervised style modeling technique to model detailed unlabeled musical expressions. We confirmed that the proposed model can flexibly control various elements such as pronunciation, pitch, timbre, singing style, and musical expression, while synthesizing high-quality singing that is difficult to distinguish from ground truth singing. Furthermore, we proposed a generation and modification framework that considers the situation applied to the actual music production process, and confirmed that it is possible to apply it to expand the limits of the creator's imagination, such as new voice design and cross-generation.๊ฐ€์ฐฝ ํ•ฉ์„ฑ์€ ์ฃผ์–ด์ง„ ์ž…๋ ฅ ์•…๋ณด๋กœ๋ถ€ํ„ฐ ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ฐ€์ฐฝ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•ด๋‚ด๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๊ฐ€์ฐฝ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์€ ์Œ์•… ์ œ์ž‘ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฐฝ์ž‘์ž์˜ ์˜๋„๋ฅผ ๋ณด๋‹ค ์‰ฝ๊ณ  ํŽธ๋ฆฌํ•˜๊ฒŒ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š”๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์˜ ์„ค๊ณ„๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€์˜ ๋„์ „์ ์ธ ์š”๊ตฌ์‚ฌํ•ญ์ด ์กด์žฌํ•œ๋‹ค. 1) ๊ฐ€์ฐฝ์„ ์ด๋ฃจ๋Š” ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. 2) ๋†’์€ ํ’ˆ์งˆ ์ˆ˜์ค€ ๋ฐ ์‚ฌ์šฉ์„ฑ์„ ๋‹ฌ์„ฑํ•ด์•ผ ํ•œ๋‹ค. 3) ์ถฉ๋ถ„ํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋Œ€ํ‘œ์ ์ธ ์Œ์„ฑ ์ƒ์„ฑ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์ธ ์†Œ์Šค-ํ•„ํ„ฐ ์ด๋ก ์— ์ฃผ๋ชฉํ•˜์˜€๋‹ค. ๊ฐ€์ฐฝ ์‹ ํ˜ธ๋ฅผ ์Œ์ • ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ์†Œ์Šค์™€ ๋ฐœ์Œ ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ํ•„ํ„ฐ์˜ ํ•ฉ์„ฑ๊ณฑ์œผ๋กœ ์ •์˜ํ•˜๊ณ , ์ด๋ฅผ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ๊ณผ ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ์„ ๋™์‹œ์— ํ™•๋ณดํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ๋Š” ๋ฐœ์Œ, ์Œ์ •, ํ™”์ž ๋“ฑ ์กฐ๊ฑด๋ถ€ ์ž…๋ ฅ์ด ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์กฐ๊ฑด๋ถ€ ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ™œ์šฉํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ ˆ์ด๋ธ”๋ง ๋˜์–ด์žˆ์ง€ ์•Š์€ ์Œ์•…์  ํ‘œํ˜„์„ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ์šฐ๋ฆฌ๋Š” ์ž๊ธฐ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์Šคํƒ€์ผ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ–ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ œ์•ˆํ•œ ๋ชจ๋ธ์ด ๋ฐœ์Œ, ์Œ์ •, ์Œ์ƒ‰, ์ฐฝ๋ฒ•, ํ‘œํ˜„ ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ์œ ์—ฐํ•˜๊ฒŒ ์ œ์–ดํ•˜๋ฉด์„œ๋„ ์‹ค์ œ ๊ฐ€์ฐฝ๊ณผ ๊ตฌ๋ถ„์ด ์–ด๋ ค์šด ์ˆ˜์ค€์˜ ๊ณ ํ’ˆ์งˆ ๊ฐ€์ฐฝ ํ•ฉ์„ฑ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ–ˆ๋‹ค. ๋‚˜์•„๊ฐ€ ์‹ค์ œ ์Œ์•… ์ œ์ž‘ ๊ณผ์ •์„ ๊ณ ๋ คํ•œ ์ƒ์„ฑ ๋ฐ ์ˆ˜์ • ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์˜€๊ณ , ์ƒˆ๋กœ์šด ๋ชฉ์†Œ๋ฆฌ ๋””์ž์ธ, ๊ต์ฐจ ์ƒ์„ฑ ๋“ฑ ์ฐฝ์ž‘์ž์˜ ์ƒ์ƒ๋ ฅ๊ณผ ํ•œ๊ณ„๋ฅผ ๋„“ํž ์ˆ˜ ์žˆ๋Š” ์‘์šฉ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ–ˆ๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Problems in singing voice synthesis 4 1.3 Task of interest 8 1.3.1 Single-singer SVS 9 1.3.2 Multi-singer SVS 10 1.3.3 Expressive SVS 11 1.4 Contribution 11 2 Background 13 2.1 Singing voice 14 2.2 Source-filter theory 18 2.3 Autoregressive model 21 2.4 Related works 22 2.4.1 Speech synthesis 25 2.4.2 Singing voice synthesis 29 3 Adversarially Trained End-to-end Korean Singing Voice Synthesis System 31 3.1 Introduction 31 3.2 Related work 33 3.3 Proposed method 35 3.3.1 Input representation 35 3.3.2 Mel-synthesis network 36 3.3.3 Super-resolution network 38 3.4 Experiments 42 3.4.1 Dataset 42 3.4.2 Training 42 3.4.3 Evaluation 43 3.4.4 Analysis on generated spectrogram 46 3.5 Discussion 49 3.5.1 Limitations of input representation 49 3.5.2 Advantages of using super-resolution network 53 3.6 Conclusion 55 4 Disentangling Timbre and Singing Style with multi-singer Singing Synthesis System 57 4.1Introduction 57 4.2 Related works 59 4.2.1 Multi-singer SVS system 60 4.3 Proposed Method 60 4.3.1 Singer identity encoder 62 4.3.2 Disentangling timbre & singing style 64 4.4 Experiment 64 4.4.1 Dataset and preprocessing 64 4.4.2 Training & inference 65 4.4.3 Analysis on generated spectrogram 65 4.4.4 Listening test 66 4.4.5 Timbre & style classification test 68 4.5 Discussion 70 4.5.1 Query audio selection strategy for singer identity encoder 70 4.5.2 Few-shot adaptation 72 4.6 Conclusion 74 5 Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder 77 5.1 Introduction 77 5.2 Related work 79 5.3 Proposed method 80 5.3.1 Local style token module 80 5.3.2 Dual-path pitch encoder 85 5.3.3 Bandwidth extension vocoder 85 5.4 Experiment 86 5.4.1 Dataset 86 5.4.2 Training 86 5.4.3 Qualitative evaluation 87 5.4.4 Dual-path reconstruction analysis 89 5.4.5 Qualitative analysis 90 5.5 Discussion 93 5.5.1 Difference between midi pitch and f0 93 5.5.2 Considerations for use in the actual music production process 94 5.6 Conclusion 95 6 Conclusion 97 6.1 Thesis summary 97 6.2 Limitations and future work 99 6.2.1 Improvements to a faster and robust system 99 6.2.2 Explainable and intuitive controllability 101 6.2.3 Extensions to common speech synthesis tools 103 6.2.4 Towards a collaborative and creative tool 104๋ฐ•

    ใƒใƒงใ‚ฆใ‚ซใ‚ฏ ใ‚ฒใ‚คใ‚ธใƒฅใƒ„ ใƒ˜ใƒŽ ใ‚ธใƒงใ‚ฆใƒ›ใ‚ฆใ‚ฌใ‚ฏใƒ†ใ‚ญ ใ‚ขใƒ—ใƒญใƒผใƒ ใƒˆ ใ‚ชใƒณใ‚ฌใ‚ฏ ใ‚ธใƒงใ‚ฆใƒ›ใ‚ฆ ใ‚ทใƒงใƒช ใƒ„ใƒผใƒซ ใƒŽ ใ‚ซใ‚คใƒใƒ„ ใ‚ธใƒฌใ‚ค

    Get PDF
    ใƒ’ใƒˆ่ด่ฆšใซใฏ่ธ็‰›ใจใ„ใ†่žบๆ—‹ๅฝขใ‚’ใ—ใŸ้Ÿณ้šŽ็†่ซ–็š„ใซๅ‘จๆณขๆ•ฐๅˆ†ๆžใ‚’่กŒใฃใฆใ„ใ‚‹ๅ™จๅฎ˜ใŒใ‚ใ‚‹ใ€‚ใƒ’ใƒˆใฎไผš่ฉฑใ‚’ใฏใ˜ใ‚ใ€ๅŒใ˜่ด่ฆšๅ™จๅฎ˜ใ‚’็”จใ„ใฆ้Ÿณ้šŽ็†่ซ–็š„ใช่ช่ญ˜ใ‚’่กŒใฃใฆใ„ใ‚‹ไปฅไธŠใ€ไปปๆ„ใฎ้Ÿณใ‚’ไบ”็ทš่ญœใง่กจ็พใ™ใ‚‹ใ“ใจใฏไธๅฏ่ƒฝใงใฏใชใ„ใ€‚ใใ“ใง็ญ†่€…ใ‚‰ใฏใ€ใƒ’ใƒˆใฎ้Ÿณๅฃฐใ‚’ๅซใ‚€ใ€ใ‚ใ‚‰ใ‚†ใ‚‹้Ÿณใ‚’ไบ”็ทš่ญœใซ่ผ‰ใ›ใ‚‹ใ“ใจใ‚’ๅฏ่ƒฝใซใ™ใ‚‹ใ€ไธ€่ˆฌ้Ÿณ้ŸฟไฟกๅทใฎMIDI ็ฌฆๅทๅŒ–ใƒ„ใƒผใƒซใฎ้–‹็™บใซ็€ๆ‰‹ใ—ใŸใ€‚We human beings have auditory organs called as a cochlear, which make frequency analysis on themusical scale. As we recognize any kinds of acoustic signals including human speech sounds with thesame auditory organs, it seems not impossible to transcript them in a staff notation. Therefore, we havebegun developing a MIDI encoding tool for general acoustic signals, which can transcript any kinds ofgiven acoustic signals including human speech sounds in a staff notation.After that, we have tried to develop a lossless compression technique for audio signals; an automaticendless background music composition technique by synthesizing selected rhythm, chord and melodyphrase parts; an audio fingerprint technique for identifying music works; and an inaudible sign informationembedding technique for music works.In this paper, we will review these development examples of music informatics techniques done byus for past 15 years, and discuss future research topics

    Generative models for music using transformer architectures

    Get PDF
    openThis thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments.This thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments

    Danalog: Digital Music Synthesizer

    Get PDF
    The Danalog is a 25 key portable digital music synthesizer that uses multiple synthesis methods and effects to generate sounds. Sound varieties included three synthesis methods including FM, subtractive, and sample-based, with up to eight adjustable parameters, at least four effects, including reverb, chorus, and flange, with five adjustable parameters, and at least two note polyphony, and a five band equalizer. The user would be able to adjust these effects using digital encoders and potentiometers and view the settings on two LCD screens. The finals project was unable to meet the original design requirements. The FM synthesis method was primarily working in the end product. The synthesizer was built to produce two note polyphony. The LCD screens displayed the information about the synthesis method as the user plays

    Pathway to Future Symbiotic Creativity

    Full text link
    This report presents a comprehensive view of our vision on the development path of the human-machine symbiotic art creation. We propose a classification of the creative system with a hierarchy of 5 classes, showing the pathway of creativity evolving from a mimic-human artist (Turing Artists) to a Machine artist in its own right. We begin with an overview of the limitations of the Turing Artists then focus on the top two-level systems, Machine Artists, emphasizing machine-human communication in art creation. In art creation, it is necessary for machines to understand humans' mental states, including desires, appreciation, and emotions, humans also need to understand machines' creative capabilities and limitations. The rapid development of immersive environment and further evolution into the new concept of metaverse enable symbiotic art creation through unprecedented flexibility of bi-directional communication between artists and art manifestation environments. By examining the latest sensor and XR technologies, we illustrate the novel way for art data collection to constitute the base of a new form of human-machine bidirectional communication and understanding in art creation. Based on such communication and understanding mechanisms, we propose a novel framework for building future Machine artists, which comes with the philosophy that a human-compatible AI system should be based on the "human-in-the-loop" principle rather than the traditional "end-to-end" dogma. By proposing a new form of inverse reinforcement learning model, we outline the platform design of machine artists, demonstrate its functions and showcase some examples of technologies we have developed. We also provide a systematic exposition of the ecosystem for AI-based symbiotic art form and community with an economic model built on NFT technology. Ethical issues for the development of machine artists are also discussed

    Live interactive music performance through the Internet

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1996.Includes bibliographical references (p. 96-97).by Charles Wei-Ting Tang.M.S

    Expressive Musical Robots: Building, Evaluating, and Interfacing with an Ensemble of Mechatronic Instruments

    No full text
    An increase in the number of parameters of expression on musical robots can result in an increase in their expressivity as musical instruments. This thesis focuses on the design, construction, and implementation of four new robotic instruments, each designed to add more parametric control than is typical for the current state of the art of musical robotics. The principles followed in the building of the four new instruments are scalable and can be applied to musical robotics in general: the techniques exhibited in this thesis for the construction and use of musical robotics can be used by composers, musicians, and installation artists to add expressive depth to their own works with robotic instruments. Accompanying the increase in parametric depth applied to the musical robotics is an increase in difficulty in interfacing with them: robots with a greater number of actuators require more time to program. This document aims to address this problem in two ways: the use of closed-loop control for low-level adjustments of the robots and the use of a parametric encoding-equipped musical robot network to provide composers with intuitive musical commands for the robots. The musical robots introduced, described, and applied in this thesis were conceived of as musical instruments for performance and installation use by artists. This thesis closes with an exhibition of the performance and installation uses of these new robots and with a discussion of future research directions

    Deep Visual Instruments: Realtime Continuous, Meaningful Human Control over Deep Neural Networks for Creative Expression

    Get PDF
    In this thesis, we investigate Deep Learning models as an artistic medium for new modes of performative, creative expression. We call these Deep Visual Instruments: realtime interactive generative systems that exploit and leverage the capabilities of state-of-the-art Deep Neural Networks (DNN), while allowing Meaningful Human Control, in a Realtime Continuous manner. We characterise Meaningful Human Control in terms of intent, predictability, and accountability; and Realtime Continuous Control with regards to its capacity for performative interaction with immediate feedback, enhancing goal-less exploration. The capabilities of DNNs that we are looking to exploit and leverage in this manner, are their ability to learn hierarchical representations modelling highly complex, real-world data such as images. Thinking of DNNs as tools that extract useful information from massive amounts of Big Data, we investigate ways in which we can navigate and explore what useful information a DNN has learnt, and how we can meaningfully use such a model in the production of artistic and creative works, in a performative, expressive manner. We present five studies that approach this from different but complementary angles. These include: a collaborative, generative sketching application using MCTS and discriminative CNNs; a system to gesturally conduct the realtime generation of text in different styles using an ensemble of LSTM RNNs; a performative tool that allows for the manipulation of hyperparameters in realtime while a Convolutional VAE trains on a live camera feed; a live video feed processing software that allows for digital puppetry and augmented drawing; and a method that allows for long-form story telling within a generative model's latent space with meaningful control over the narrative. We frame our research with the realtime, performative expression provided by musical instruments as a metaphor, in which we think of these systems as not used by a user, but played by a performer

    The augmented tonoscope towards a deeper understanding of the interplay between sound and image in visual music

    Get PDF
    This thesis presents the theoretical, technical and aesthetic concerns in realising a harmonic complementarity and more intimate perceptual connection between music and moving image. It explores the inspirations and various processes involved in creating a series of artistic works - attached as a portfolio and produced as the research. This includes the Cymatic Adufe (v1.1) - a sound-responsive, audiovisual installation; Stravinsky Rose (v2.0) - an audiovisual short in Dome format; and the live performance works of Whitney Triptych (v1.2), Moirรฉ Modes (v1.1) and Stravinsky Rose (v3.0). The thesis outlines an approach towards realising a deeper understanding of the interplay between sound and image in Visual Music - through applying: the Differential Dynamics of pioneering, computer-aided, experimental animator John Whitney Sr.; alternate musical tunings based on harmonic consonance and the Pythagorean laws of harmony; and soundโ€™s ability to induce physical form and flow via Cymatics - the study of wave phenomena and vibration - a term coined by Dr. Hans Jenny for his seminal research into these effects in the 1960s and 70s, using a device of his own design - the สปtonoscopeสผ. The thesis discusses the key method for this artistic investigation through the design, fabrication and crafting of a hybrid analogue/digital audiovisual instrument - a contemporary version of Jennyโ€™s sound visualisation tool - The Augmented Tonoscope. It details the developmental process which has realised a modular performance system integrating sound making, sound analysis, analogue outputs, virtual systems, musical interface and recording and sequencing. Finally, the thesis details the impact of this system on creating audiovisualisation of a distinct quality through: a formalist, minimal, decluttered aesthetic; a direct, elemental and real-time correspondence between sound and image; a mirroring of musicโ€™s innate movement and transition within the visual domain; and an underlying concord or harmony between music and moving image
    • โ€ฆ
    corecore