56,096 research outputs found
Pre-training Music Classification Models via Music Source Separation
In this paper, we study whether music source separation can be used as a
pre-training strategy for music representation learning, targeted at music
classification tasks. To this end, we first pre-train U-Net networks under
various music source separation objectives, such as the isolation of vocal or
instrumental sources from a musical piece; afterwards, we attach a
convolutional tail network to the pre-trained U-Net and jointly finetune the
whole network. The features learned by the separation network are also
propagated to the tail network through skip connections. Experimental results
in two widely used and publicly available datasets indicate that pre-training
the U-Nets with a music source separation objective can improve performance
compared to both training the whole network from scratch and using the tail
network as a standalone in two music classification tasks: music auto-tagging,
when vocal separation is used, and music genre classification for the case of
multi-source separation.Comment: 5 pages (4+references), 3 figures. ICASSP-24 submissio
Automated Composition of Picture-Synched Music Soundtracks for Movies
We describe the implementation of and early results from a system that
automatically composes picture-synched musical soundtracks for videos and
movies. We use the phrase "picture-synched" to mean that the structure of the
automatically composed music is determined by visual events in the input movie,
i.e. the final music is synchronised to visual events and features such as cut
transitions or within-shot key-frame events. Our system combines automated
video analysis and computer-generated music-composition techniques to create
unique soundtracks in response to the video input, and can be thought of as an
initial step in creating a computerised replacement for a human composer
writing music to fit the picture-locked edit of a movie. Working only from the
video information in the movie, key features are extracted from the input
video, using video analysis techniques, which are then fed into a
machine-learning-based music generation tool, to compose a piece of music from
scratch. The resulting soundtrack is tied to video features, such as scene
transition markers and scene-level energy values, and is unique to the input
video. Although the system we describe here is only a preliminary
proof-of-concept, user evaluations of the output of the system have been
positive.Comment: To be presented at the 16th ACM SIGGRAPH European Conference on
Visual Media Production. London, England: 17th-18th December 2019. 10 pages,
9 figure
MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment
Generating music has a few notable differences from generating images and
videos. First, music is an art of time, necessitating a temporal model. Second,
music is usually composed of multiple instruments/tracks with their own
temporal dynamics, but collectively they unfold over time interdependently.
Lastly, musical notes are often grouped into chords, arpeggios or melodies in
polyphonic music, and thereby introducing a chronological ordering of notes is
not naturally suitable. In this paper, we propose three models for symbolic
multi-track music generation under the framework of generative adversarial
networks (GANs). The three models, which differ in the underlying assumptions
and accordingly the network architectures, are referred to as the jamming
model, the composer model and the hybrid model. We trained the proposed models
on a dataset of over one hundred thousand bars of rock music and applied them
to generate piano-rolls of five tracks: bass, drums, guitar, piano and strings.
A few intra-track and inter-track objective metrics are also proposed to
evaluate the generative results, in addition to a subjective user study. We
show that our models can generate coherent music of four bars right from
scratch (i.e. without human inputs). We also extend our models to human-AI
cooperative music generation: given a specific track composed by human, we can
generate four additional tracks to accompany it. All code, the dataset and the
rendered audio samples are available at https://salu133445.github.io/musegan/ .Comment: to appear at AAAI 201
Look, Listen and Learn
We consider the question: what can be learnt by looking at and listening to a
large number of unlabelled videos? There is a valuable, but so far untapped,
source of information contained in the video itself -- the correspondence
between the visual and the audio streams, and we introduce a novel
"Audio-Visual Correspondence" learning task that makes use of this. Training
visual and audio networks from scratch, without any additional supervision
other than the raw unconstrained videos themselves, is shown to successfully
solve this task, and, more interestingly, result in good visual and audio
representations. These features set the new state-of-the-art on two sound
classification benchmarks, and perform on par with the state-of-the-art
self-supervised approaches on ImageNet classification. We also demonstrate that
the network is able to localize objects in both modalities, as well as perform
fine-grained recognition tasks.Comment: Appears in: IEEE International Conference on Computer Vision (ICCV)
201
App creation in schools for different curricula subjects - lesson learned
The next generation of jobs will be characterized by an increased demand for
people with computational and problem solving skills. In Austria, computer
science topics are underrepresented in school curricula hence teaching time for
these topics is limited. From primary through secondary school, only a few
opportunities exist for young students to explore programming. Furthermore,
today's teachers are rarely trained in computer science, which impairs their
potential to motivate students in these courses. Within the "No One Left
Behind" (NOLB) project, teachers were supported to guide and assist their
students in their learning processes by constructing ideas through game making.
Thus, students created games that referred to different subject areas by using
the programming tool Pocket Code, an app developed at Graz University of
Technology (TU-Graz). This tool helps students to take control of their own
education, becoming more engaged, interested, and empowered as a result. To
ensure an optimal integration of the app in diverse subjects the different
backgrounds (technical and non-technical) of teachers must be considered as
well. First, teachers were supported to use Pocket Code in the different
subjects in school within the feasibility study of the project. Observed
challenges and difficulties using the app have been gathered. Second, we
conducted interviews with teachers and students to underpin our onsite
observations. As a result, it was possible to validate Pocket Codes' potential
to be used in a diverse range of subjects. Third, we focused especially on
those teachers who were not technically trained to provide them with a
framework for Pocket Code units, e.g., with the help of structured lesson plans
and predefined templates.Comment: 10 pages, 5 tables EduLearn 201
End-to-End Cross-Modality Retrieval with CCA Projections and Pairwise Ranking Loss
Cross-modality retrieval encompasses retrieval tasks where the fetched items
are of a different type than the search query, e.g., retrieving pictures
relevant to a given text query. The state-of-the-art approach to cross-modality
retrieval relies on learning a joint embedding space of the two modalities,
where items from either modality are retrieved using nearest-neighbor search.
In this work, we introduce a neural network layer based on Canonical
Correlation Analysis (CCA) that learns better embedding spaces by analytically
computing projections that maximize correlation. In contrast to previous
approaches, the CCA Layer (CCAL) allows us to combine existing objectives for
embedding space learning, such as pairwise ranking losses, with the optimal
projections of CCA. We show the effectiveness of our approach for
cross-modality retrieval on three different scenarios (text-to-image,
audio-sheet-music and zero-shot retrieval), surpassing both Deep CCA and a
multi-view network using freely learned projections optimized by a pairwise
ranking loss, especially when little training data is available (the code for
all three methods is released at: https://github.com/CPJKU/cca_layer).Comment: Preliminary version of a paper published in the International Journal
of Multimedia Information Retrieva
- …