24 research outputs found
On the Application of Generic Summarization Algorithms to Music
Several generic summarization algorithms were developed in the past and
successfully applied in fields such as text and speech summarization. In this
paper, we review and apply these algorithms to music. To evaluate this
summarization's performance, we adopt an extrinsic approach: we compare a Fado
Genre Classifier's performance using truncated contiguous clips against the
summaries extracted with those algorithms on 2 different datasets. We show that
Maximal Marginal Relevance (MMR), LexRank and Latent Semantic Analysis (LSA)
all improve classification performance in both datasets used for testing.Comment: 12 pages, 1 table; Submitted to IEEE Signal Processing Letter
Using Generic Summarization to Improve Music Information Retrieval Tasks
In order to satisfy processing time constraints, many MIR tasks process only
a segment of the whole music signal. This practice may lead to decreasing
performance, since the most important information for the tasks may not be in
those processed segments. In this paper, we leverage generic summarization
algorithms, previously applied to text and speech summarization, to summarize
items in music datasets. These algorithms build summaries, that are both
concise and diverse, by selecting appropriate segments from the input signal
which makes them good candidates to summarize music as well. We evaluate the
summarization process on binary and multiclass music genre classification
tasks, by comparing the performance obtained using summarized datasets against
the performances obtained using continuous segments (which is the traditional
method used for addressing the previously mentioned time constraints) and full
songs of the same original dataset. We show that GRASSHOPPER, LexRank, LSA,
MMR, and a Support Sets-based Centrality model improve classification
performance when compared to selected 30-second baselines. We also show that
summarized datasets lead to a classification performance whose difference is
not statistically significant from using full songs. Furthermore, we make an
argument stating the advantages of sharing summarized datasets for future MIR
research.Comment: 24 pages, 10 tables; Submitted to IEEE/ACM Transactions on Audio,
Speech and Language Processin
Recommended from our members
Quantifying Joint Activities using Cross-Recurrence Block Representation
Humans, as social beings, are capable of employing variousbehavioral cues, such as gaze, speech, manual action, and bodyposture, in everyday communication. However, to extract fine-grained interaction patterns in social contexts has beenpresented with methodological challenges. Cross-RecurrencePlot Quantification Analysis (CRQA) is an analysis methodinvented in theoretical physics and recently applied tocognitive science to study interpersonal coordination. In thispaper, we extend this approach to analyzing joint activities inchild-parent interaction. We define a new representation asCross Recurrence Block based on CRQA. With thisrepresentation, we are able to capture interpersonal dynamicsfrom more than two behavioral streams in one CrossRecurrence Plot and derive a suite of measures to quantifydetailed characteristics of coordination. Using a datasetcollected from a child-parent interaction study, we show thatthese quantitative measures of joint activities revealdevelopmental changes in coordinative behavioral patternsbetween children and parents
Detection of speech signal in strong ship-radiated noise based on spectrum entropy
Comparing the frequency spectrum distributions calculated from several successive frames, the change of the frequency spectrum of speech frames between successive frames is larger than that of the ship-radiated noise. The aim of this work is to propose a novel speech detection algorithm in strong ship-radiated noise. As inaccurate sentence boundaries are a major cause in automatic speech recognition in strong noise background. Hence, based on that characteristic, a new feature repeating pattern of frequency spectrum trend (RPFST) was calculated based on spectrum entropy. Firstly, the speech is detected roughly with the precision of 1 s by calculating the feature RPFST. Then, the detection precision is up to 20 ms, the length of frames, by method of frame shifting. Finally, benchmarked on a large measured data set, the detection accuracy (92 %) is achieved. The experimental results show the feasibility of the algorithm to all kinds of speech and ship-radiated noise
Inpainting of long audio segments with similarity graphs
We present a novel method for the compensation of long duration data loss in
audio signals, in particular music. The concealment of such signal defects is
based on a graph that encodes signal structure in terms of time-persistent
spectral similarity. A suitable candidate segment for the substitution of the
lost content is proposed by an intuitive optimization scheme and smoothly
inserted into the gap, i.e. the lost or distorted signal region. Extensive
listening tests show that the proposed algorithm provides highly promising
results when applied to a variety of real-world music signals
Music shapelets for fast cover song regognition
A cover song is a new performance or recording of a previously recorded music by an artist other than the original one. The automatic identification of cover songs is useful for a wide range of tasks, from fans looking for new versions of their favorite songs to organizations involved in licensing copyrighted songs. This is a difficult task given that a cover may differ from the original song in key, timbre, tempo, structure, arrangement and even language of the vocals. Cover song identification has attracted some attention recently. However, most of the state-of-the-art approaches are based on similarity search, which involves a large number of similarity computations to retrieve potential cover versions for a query recording. In this paper, we adapt the idea of time series shapelets for contentbased music retrieval. Our proposal adds a training phase that finds small excerpts of feature vectors that best describe each song. We demonstrate that we can use such small segments to identify cover songs with higher identification rates and more than one order of magnitude faster than methods that use features to describe the whole music.FAPESP (grants #2011/17698-5, #2013/26151-5, and 2015/07628-0)CNPq (grants 446330/2014-0 and 303083/2013-1
Fusion of Multimodal Information in Music Content Analysis
Music is often processed through its acoustic realization. This is restrictive in the sense that music is clearly a highly multimodal concept where various types of heterogeneous information can be associated to a given piece of music (a musical score, musicians\u27 gestures, lyrics, user-generated metadata, etc.). This has recently led researchers to apprehend music through its various facets, giving rise to "multimodal music analysis" studies. This article gives a synthetic overview of methods that have been successfully employed in multimodal signal analysis. In particular, their use in music content processing is discussed in more details through five case studies that highlight different multimodal integration techniques. The case studies include an example of cross-modal correlation for music video analysis, an audiovisual drum transcription system, a description of the concept of informed source separation, a discussion of multimodal dance-scene analysis, and an example of user-interactive music analysis. In the light of these case studies, some perspectives of multimodality in music processing are finally suggested
Clustering by compression
We present a new method for clustering based on compression. The method
doesn't use subject-specific features or background knowledge, and works as
follows: First, we determine a universal similarity distance, the normalized
compression distance or NCD, computed from the lengths of compressed data files
(singly and in pairwise concatenation). Second, we apply a hierarchical
clustering method. The NCD is universal in that it is not restricted to a
specific application area, and works across application area boundaries. A
theoretical precursor, the normalized information distance, co-developed by one
of the authors, is provably optimal but uses the non-computable notion of
Kolmogorov complexity. We propose precise notions of similarity metric, normal
compressor, and show that the NCD based on a normal compressor is a similarity
metric that approximates universality. To extract a hierarchy of clusters from
the distance matrix, we determine a dendrogram (binary tree) by a new quartet
method and a fast heuristic to implement it. The method is implemented and
available as public software, and is robust under choice of different
compressors. To substantiate our claims of universality and robustness, we
report evidence of successful application in areas as diverse as genomics,
virology, languages, literature, music, handwritten digits, astronomy, and
combinations of objects from completely different domains, using statistical,
dictionary, and block sorting compressors. In genomics we presented new
evidence for major questions in Mammalian evolution, based on
whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta
hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure
Detection of speech signal in strong ship-radiated noise based on spectrum entropy
Comparing the frequency spectrum distributions calculated from several successive frames, the change of the frequency spectrum of speech frames between successive frames is larger than that of the ship-radiated noise. The aim of this work is to propose a novel speech detection algorithm in strong ship-radiated noise. As inaccurate sentence boundaries are a major cause in automatic speech recognition in strong noise background. Hence, based on that characteristic, a new feature repeating pattern of frequency spectrum trend (RPFST) was calculated based on spectrum entropy. Firstly, the speech is detected roughly with the precision of 1 s by calculating the feature RPFST. Then, the detection precision is up to 20 ms, the length of frames, by method of frame shifting. Finally, benchmarked on a large measured data set, the detection accuracy (92 %) is achieved. The experimental results show the feasibility of the algorithm to all kinds of speech and ship-radiated noise
Onde é que eu já ouvi isto?
Tese de mestrado em Engenharia Informática, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2012Nos últimos anos, avanços tecnológicos a nível de compressão de áudio e redes de computadores tem solicitado um aumento gigante na disponibilidade e partilha de música digital.
O objectivo fundamental deste projecto é desenvolver um protótipo, pelo qual a semelhança entre várias peças de áudio possa ser medida, exclusivamente, no conteúdo do áudio em si, isto é, a partir das suas propriedades e características mais básicas.
Este protótipo irá analisar as características inerentes de cada peça de áudio e usar os dados provenientes dessa análise para comparar músicas, independentemente de qualquer metadata que possa existir. A base para essa comparação consiste numa impressão digital do áudio em si, que tem como objectivo gerar uma assinatura que identifica um pedaço de áudio. Esta assinatura, transforma o sinal de áudio numa sequência de vectores sendo esta sequência de vectores, um conjunto de características espectrais, representadas como: Zero-Crossings, Spectral Centroid, Rolloff, Flux e Mel-Frequency Cepstral Coeficientes (MFCC) do sinal de áudio. Mais especificamente, o sinal de áudio é convertido numa sequência de símbolos, que correspondem às características de uma peça de áudio. Esta “impressão digital” do áudio, não só identifica uma peça musical, mas também fornece informações sobre suas características musicais.
Usando este protótipo, será possível uma selecção de filmes com base na semelhança entre as peças de áudio, ou seja, será possível exibir ao usuário uma série de filmes, que possuam sequências de áudio semelhante a um tipo de áudio escolhido pelo mesmo permitindo, por isso, pesquisar numa base de documentos de vídeo através, apenas, de peças de áudio.
O trabalho insere-se numa das tarefas do projecto VIRUS (Video Information Retrieval Using Subtitles), financiado pela FCT, para a qual as técnicas foram, grande parte, já desenvolvidas.Over de last ten years, technological advances at the level of compression of audio and computer networks has prompted a huge increase in the availability and sharing of digital music.
The main purpose of this project is to develop a prototype, for which the similarity between various pieces of audio can be measured, exclusively on the audio content itself, that is, from their most basic properties and characteristics.
This prototype will analyze the inherent characteristics of each piece of audio and use the data from this analysis to compare music regardless, of any metadata that may exist. The basis for this comparison is a fingerprint of the audio itself, which aims to generate a signature that identifies the piece of audio. This signature, transform the audio signal is a sequence of vectors witch is, a set of spectral features, represented as: Zero-Crossings, Spectral Centroid, Rolloff, Flux and Mel-Frequency Cepstral Coefficients (MFCC) audio signal. More specifically, the audio signal is converted into a sequence of symbols that correspond to the characteristics of a piece of audio. This "fingerprint" of the audio, not only identifies a piece of music, but also provides information on its musical characteristics.
Using this prototype, it’s possible to select a movie based on the similarity between pieces of audio specified by the user, or the user can a series of films that have audio similar to a type of audio selected by de user. Through this prototype is also possible to search in a database of video, by specifying only pieces of audio.
The work is part of a project's tasks VIRUS (Video Information Retrieval Using Subtitles), funded by FCT, for which the techniques were largely already developed