34 research outputs found
Signal processing methods for beat tracking, music segmentation, and audio retrieval
The goal of music information retrieval (MIR) is to develop novel strategies and techniques for organizing, exploring, accessing, and understanding music data in an efficient manner. The conversion of waveform-based audio data into semantically meaningful feature representations by the use of digital signal processing techniques is at the center of MIR and constitutes a difficult field of research because of the complexity and diversity of music signals. In this thesis, we introduce novel signal processing methods that allow for extracting musically meaningful information from audio signals. As main strategy, we exploit musical knowledge about the signals\u27 properties to derive feature representations that show a significant degree of robustness against musical variations but still exhibit a high musical expressiveness. We apply this general strategy to three different areas of MIR: Firstly, we introduce novel techniques for extracting tempo and beat information, where we particularly consider challenging music with changing tempo and soft note onsets. Secondly, we present novel algorithms for the automated segmentation and analysis of folk song field recordings, where one has to cope with significant fluctuations in intonation and tempo as well as recording artifacts. Thirdly, we explore a cross-version approach to content-based music retrieval based on the query-by-example paradigm. In all three areas, we focus on application scenarios where strong musical variations make the extraction of musically meaningful information a challenging task.Ziel der automatisierten Musikverarbeitung ist die Entwicklung neuer Strategien und Techniken zur effizienten Organisation großer Musiksammlungen. Ein Schwerpunkt liegt in der Anwendung von Methoden der digitalen Signalverarbeitung zur Umwandlung von Audiosignalen in musikalisch aussagekräftige Merkmalsdarstellungen. Große Herausforderungen bei dieser Aufgabe ergeben sich aus der Komplexität und Vielschichtigkeit der Musiksignale. In dieser Arbeit werden neuartige Methoden vorgestellt, mit deren Hilfe musikalisch interpretierbare Information aus Musiksignalen extrahiert werden kann. Hierbei besteht eine grundlegende Strategie in der konsequenten Ausnutzung musikalischen Vorwissens, um Merkmalsdarstellungen abzuleiten die zum einen ein hohes Maß an Robustheit gegenüber musikalischen Variationen und zum anderen eine hohe musikalische Ausdruckskraft besitzen. Dieses Prinzip wenden wir auf drei verschieden Aufgabenstellungen an: Erstens stellen wir neuartige Ansätze zur Extraktion von Tempo- und Beat-Information aus Audiosignalen vor, die insbesondere auf anspruchsvolle Szenarien mit wechselnden Tempo und weichen Notenanfängen angewendet werden. Zweitens tragen wir mit neuartigen Algorithmen zur Segmentierung und Analyse von Feldaufnahmen von Volksliedern unter Vorliegen großer Intonationsschwankungen bei. Drittens entwickeln wir effiziente Verfahren zur inhaltsbasierten Suche in großen Datenbeständen mit dem Ziel, verschiedene Interpretationen eines Musikstückes zu detektieren. In allen betrachteten Szenarien richten wir unser Augenmerk insbesondere auf die Fälle in denen auf Grund erheblicher musikalischer Variationen die Extraktion musikalisch aussagekräftiger Informationen eine große Herausforderung darstellt
Real-time Sound Source Separation For Music Applications
Sound source separation refers to the task of extracting individual sound sources from some number of mixtures of those sound sources. In this thesis, a novel sound source separation algorithm for musical applications is presented. It leverages the fact that the vast majority of commercially recorded music since the 1950s has been mixed down for two channel reproduction, more commonly known as stereo. The algorithm presented in Chapter 3 in this thesis requires no prior knowledge or learning and performs the task of separation based purely on azimuth discrimination within the stereo field. The algorithm exploits the use of the pan pot as a means to achieve image localisation within stereophonic recordings. As such, only an interaural intensity difference exists between left and right channels for a single source. We use gain scaling and phase cancellation techniques to expose frequency dependent nulls across the azimuth domain, from which source separation and resynthesis is carried out. The algorithm is demonstrated to be state of the art in the field of sound source separation but also to be a useful pre-process to other tasks such as music segmentation and surround sound upmixing
Spatial auditory display for acoustics and music collections
PhDThis thesis explores how audio can be better incorporated into how people access
information and does so by developing approaches for creating three-dimensional audio
environments with low processing demands. This is done by investigating three research
questions.
Mobile applications have processor and memory requirements that restrict the
number of concurrent static or moving sound sources that can be rendered with binaural
audio. Is there a more e cient approach that is as perceptually accurate as the traditional
method? This thesis concludes that virtual Ambisonics is an ef cient and accurate means
to render a binaural auditory display consisting of noise signals placed on the horizontal
plane without head tracking. Virtual Ambisonics is then more e cient than convolution
of HRTFs if more than two sound sources are concurrently rendered or if movement of
the sources or head tracking is implemented.
Complex acoustics models require signi cant amounts of memory and processing. If
the memory and processor loads for a model are too large for a particular device, that
model cannot be interactive in real-time. What steps can be taken to allow a complex
room model to be interactive by using less memory and decreasing the computational
load? This thesis presents a new reverberation model based on hybrid reverberation
which uses a collection of B-format IRs. A new metric for determining the mixing
time of a room is developed and interpolation between early re
ections is investigated.
Though hybrid reverberation typically uses a recursive lter such as a FDN for the late
reverberation, an average late reverberation tail is instead synthesised for convolution
reverberation.
Commercial interfaces for music search and discovery use little aural information
even though the information being sought is audio. How can audio be used in
interfaces for music search and discovery? This thesis looks at 20 interfaces and
determines that several themes emerge from past interfaces. These include using a two
or three-dimensional space to explore a music collection, allowing concurrent playback of
multiple sources, and tools such as auras to control how much information is presented. A
new interface, the amblr, is developed because virtual two-dimensional spaces populated
by music have been a common approach, but not yet a perfected one. The amblr is also
interpreted as an art installation which was visited by approximately 1000 people over 5
days. The installation maps the virtual space created by the amblr to a physical space
Sparse Modeling of Grouped Line Spectra
This licentiate thesis focuses on clustered parametric models for estimation of line spectra, when the spectral content of a signal source is assumed to exhibit some form of grouping. Different from previous parametric approaches, which generally require explicit knowledge of the model orders, this thesis exploits sparse modeling, where the orders are implicitly chosen. For line spectra, the non-linear parametric model is approximated by a linear system, containing an overcomplete basis of candidate frequencies, called a dictionary, and a large set of linear response variables that selects and weights the components in the dictionary. Frequency estimates are obtained by solving a convex optimization program, where the sum of squared residuals is minimized. To discourage overfitting and to infer certain structure in the solution, different convex penalty functions are introduced into the optimization. The cost trade-off between fit and penalty is set by some user parameters, as to approximate the true number of spectral lines in the signal, which implies that the response variable will be sparse, i.e., have few non-zero elements. Thus, instead of explicit model orders, the orders are implicitly set by this trade-off. For grouped variables, the dictionary is customized, and appropriate convex penalties selected, so that the solution becomes group sparse, i.e., has few groups with non-zero variables. In an array of sensors, the specific time-delays and attenuations will depend on the source and sensor positions. By modeling this, one may estimate the location of a source. In this thesis, a novel joint location and grouped frequency estimator is proposed, which exploits sparse modeling for both spectral and spatial estimates, showing robustness against sources with overlapping frequency content. For audio signals, this thesis uses two different features for clustering. Pitch is a perceptual property of sound that may be described by the harmonic model, i.e., by a group of spectral lines at integer multiples of a fundamental frequency, which we estimate by exploiting a novel adaptive total variation penalty. The other feature, chroma, is a concept in musical theory, collecting pitches at powers of 2 from each other into groups. Using a chroma dictionary, together with appropriate group sparse penalties, we propose an automatic transcription of the chroma content of a signal
Machine Learning for Auditory Hierarchy
Coleman, W. (2021). Machine Learning for Auditory Hierarchy. This dissertation is submitted for the degree of Doctor of Philosophy, Technological University Dublin. Audio content is predominantly delivered in a stereo audio file of a static, pre-formed mix. The content creator makes volume, position and effects decisions, generally for presentation in stereo speakers, but has no control ultimately over how the content will be consumed. This leads to poor listener experience when, for example, a feature film is mixed such that the dialogue is at a low level relative to the sound effects. Consumers can complain that they must turn the volume up to hear the words, but back down again because the effects levels are too loud. Addressing this problem requires a television mix optimised for the stereo speakers used in the vast majority of homes, which is not always available
Flamenco music information retrieval.
El flamenco, un gĂ©nero musical centrado en la improvisaciĂłn y la espontaneidad, tiene su origen en el sur de España y atrae a una creciente comunidad de aficionados de paĂses de todo el mundo. El aumento constante y la accesibilidad a colecciones digitales de flamenco, en archivos de mĂşsica y plataformas online, exige el desarrollo de mĂ©todos de análisis y descripciĂłn computacionales con el fin de indexar y analizar el contenido musical de manera automática. Music Information Retrieval (MIR) es un área de investigaciĂłn multidisciplinaria dedicada a la extracciĂłn automática de informaciĂłn musical desde grabaciones de audio y partituras. Sin embargo, la gran mayorĂa de las herramientas existentes se dirigen a la mĂşsica clásica y la mĂşsica popular occidental y, a menudo, no se generalizan bien a las tradiciones musicales no occidentales, particularmente cuando las suposiciones relacionadas con la teorĂa musical no son válidas para estos gĂ©neros. Por otro lado, las caracterĂsticas y los conceptos musicales especĂficos de una tradiciĂłn musical pueden implicar nuevos desafĂos computacionales, para los cuales no existen mĂ©todos adecuados. Esta tesis enfoca estas limitaciones existentes en el área abordando varios desafĂos
computacionales que surgen en el contexto de la mĂşsica flamenca. Con este fin, se realizan una serie de contribuciones en forma de algoritmos novedosos, evaluaciones comparativas y estudios basados en datos, dirigidos a varias dimensiones musicales y que abarcan varias subáreas de ingenierĂa, matemática computacional, estadĂstica, optimizaciĂłn y musicologĂa computacional. Una particularidad del gĂ©nero, que influye
enormemente en el trabajo presentado en esta tesis, es la ausencia de partituras para el cante flamenco. En consecuencia, los mĂ©todos computacionales deben basarse Ăşnicamente en el análisis de grabaciones, o de transcripciones extraĂdas automáticamente, lo que genera una colecciĂłn de nuevos problemas computacionales. Un aspecto clave del flamenco es la presencia de patrones melĂłdicos recurrentes, que esán sujetos a variaciĂłn y ornamentaciĂłn durante su interpretaciĂłn. Desde la perspectiva computacional, identificamos tres tareas relacionadas a esta caracterĂstica
que se abordan en esta tesis: la clasificaciĂłn por melodĂa, la bĂşsqueda de secuencias melĂłdicas y la extracciĂłn de patrones melĂłdicos. Además, nos acercamos a la tarea de la detecciĂłn no supervisada de frases melĂłdicas repetidas y exploramos el uso de mĂ©todos de deep learning para la identificaciĂłn de cantaores en grabaciones de video y la segmentaciĂłn estructural de grabaciones de audio. Finalmente, demostramos en un
estudio de minerĂa de datos, cĂłmo una exploraciĂłn de anotaciones extraĂdas de manera automática de un corpus amplio de grabaciones nos ayuda a descubrir correlaciones interesantes y asimilar conocimientos sobre este gĂ©nero mayormente indocumentado.Flamenco is a rich performance-oriented art music genre from Southern Spain, which attracts a growing community of aficionados around the globe. The constantly increasing number of digitally available flamenco recordings in music archives, video sharing platforms and online music services calls for the development of genre-specific description and analysis methods, capable of automatically indexing and examining these collections in a content-driven manner. Music Information Retrieval is a multi-disciplinary research area dedicated to the automatic extraction of musical information from audio recordings and scores. Most existing approaches were however developed in the context of popular or classical music and do often not generalise well to non-Western music traditions, in particular when the underlying music theoretical assumptions do not hold for these genres. The specific characteristics and concepts of a music tradition can furthermore imply newcomputational challenges, for which no suitable methods exist.
This thesis addresses these current shortcomings of Music Information Retrieval by tackling several computational challenge which arise in the context of flamenco music. To this end, a number of contributions to the field are made in form of novel algorithms, comparative evaluations and data-driven studies, directed at various musical dimensions and encompassing several sub-areas of computer science, computational mathematics, statistics, optimisation and computational musicology. A particularity of flamenco, which immensely shapes the work presented in this thesis, is the absence of written scores. Consequently, computational approaches can solely rely on the direct analysis of raw audio recordings or automatically extracted transcriptions, and this restriction generates set of new computational challenges. A key aspect of flamenco is the presence of reoccurring melodic templates, which are subject to heavy variation during performance. From a computational perspective, we identify three tasks related to this characteristic - melody classification, melody retrieval and melodic template extraction - which are addressed in this thesis. We
furthermore approach the task of detecting repeated sung phrases in an unsupervised manner and explore the use of deep learning methods for image-based singer identification in flamenco videos and structural segmentation of flamenco recordings. Finally, we demonstrate in a data-driven corpus study, how automatic annotations can be mined to discover interesting correlations and gain insights into a largely undocumented genre
Análise de vĂdeo sensĂvel
Orientadores: Anderson de Rezende Rocha, Siome Klein GoldensteinTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: VĂdeo sensĂvel pode ser definido como qualquer filme capaz de oferecer ameaças Ă sua audiĂŞncia. Representantes tĂpicos incluem Âż mas nĂŁo estĂŁo limitados a Âż pornografia, violĂŞncia, abuso infantil, crueldade contra animais, etc. Hoje em dia, com o papel cada vez mais pervasivo dos dados digitais em nossa vidas, a análise de conteĂşdo sensĂvel representa uma grande preocupação para representantes da lei, empresas, professores, e pais, devido aos potenciais danos que este tipo de conteĂşdo pode infligir a menores, estudantes, trabalhadores, etc. NĂŁo obstante, o emprego de mediadores humanos, para constantemente analisar grandes quantidades de dados sensĂveis, muitas vezes leva a ocorrĂŞncias de estresse e trauma, o que justifica a busca por análises assistidas por computador. Neste trabalho, nĂłs abordamos este problema em duas frentes. Na primeira, almejamos decidir se um fluxo de vĂdeo apresenta ou nĂŁo conteĂşdo sensĂvel, Ă qual nos referimos como classificação de vĂdeo sensĂvel. Na segunda, temos como objetivo encontrar os momentos exatos em que um fluxo começa e termina a exibição de conteĂşdo sensĂvel, em nĂvel de quadros de vĂdeo, Ă qual nos referimos como localização de conteĂşdo sensĂvel. Para ambos os casos, projetamos e desenvolvemos mĂ©todos eficazes e eficientes, com baixo consumo de memĂłria, e adequação Ă implantação em dispositivos mĂłveis. Neste contexto, nĂłs fornecemos quatro principais contribuições. A primeira Ă© uma nova solução baseada em sacolas de palavras visuais, para a classificação eficiente de vĂdeos sensĂveis, apoiada na análise de fenĂ´menos temporais. A segunda Ă© uma nova solução de fusĂŁo multimodal em alto nĂvel semântico, para a localização de conteĂşdo sensĂvel. A terceira, por sua vez, Ă© um novo detector espaço-temporal de pontos de interesse, e descritor de conteĂşdo de vĂdeo. Finalmente, a quarta contribuição diz respeito a uma base de vĂdeos anotados em nĂvel de quadro, que possui 140 horas de conteĂşdo pornográfico, e que Ă© a primeira da literatura a ser adequada para a localização de pornografia. Um aspecto relevante das trĂŞs primeiras contribuições Ă© a sua natureza de generalização, no sentido de poderem ser empregadas Âż sem modificações no passo a passo Âż para a detecção de tipos diversos de conteĂşdos sensĂveis, tais como os mencionados anteriormente. Para validação, nĂłs escolhemos pornografia e violĂŞncia Âż dois dos tipos mais comuns de material imprĂłprio Âż como representantes de interesse, de conteĂşdo sensĂvel. Nestes termos, realizamos experimentos de classificação e de localização, e reportamos resultados para ambos os tipos de conteĂşdo. As soluções propostas apresentam uma acurácia de 93% em classificação de pornografia, e permitem a correta localização de 91% de conteĂşdo pornográfico em fluxo de vĂdeo. Os resultados para violĂŞncia tambĂ©m sĂŁo interessantes: com as abordagens apresentadas, nĂłs obtivemos o segundo lugar em uma competição internacional de detecção de cenas violentas. Colocando ambas em perspectiva, nĂłs aprendemos que a detecção de pornografia Ă© mais fácil que a de violĂŞncia, abrindo várias oportunidades de pesquisa para a comunidade cientĂfica. A principal razĂŁo para tal diferença está relacionada aos nĂveis distintos de subjetividade que sĂŁo inerentes a cada conceito. Enquanto pornografia Ă© em geral mais explĂcita, violĂŞncia apresenta um espectro mais amplo de possĂveis manifestaçõesAbstract: Sensitive video can be defined as any motion picture that may pose threats to its audience. Typical representatives include Âż but are not limited to Âż pornography, violence, child abuse, cruelty to animals, etc. Nowadays, with the ever more pervasive role of digital data in our lives, sensitive-content analysis represents a major concern to law enforcers, companies, tutors, and parents, due to the potential harm of such contents over minors, students, workers, etc. Notwithstanding, the employment of human mediators for constantly analyzing huge troves of sensitive data often leads to stress and trauma, justifying the search for computer-aided analysis. In this work, we tackle this problem in two ways. In the first one, we aim at deciding whether or not a video stream presents sensitive content, which we refer to as sensitive-video classification. In the second one, we aim at finding the exact moments a stream starts and ends displaying sensitive content, at frame level, which we refer to as sensitive-content localization. For both cases, we aim at designing and developing effective and efficient methods, with low memory footprint and suitable for deployment on mobile devices. In this vein, we provide four major contributions. The first one is a novel Bag-of-Visual-Words-based pipeline for efficient time-aware sensitive-video classification. The second is a novel high-level multimodal fusion pipeline for sensitive-content localization. The third, in turn, is a novel space-temporal video interest point detector and video content descriptor. Finally, the fourth contribution comprises a frame-level annotated 140-hour pornographic video dataset, which is the first one in the literature that is appropriate for pornography localization. An important aspect of the first three contributions is their generalization nature, in the sense that they can be employed Âż without step modifications Âż to the detection of diverse sensitive content types, such as the previously mentioned ones. For validation, we choose pornography and violence Âż two of the commonest types of inappropriate material Âż as target representatives of sensitive content. We therefore perform classification and localization experiments, and report results for both types of content. The proposed solutions present an accuracy of 93% in pornography classification, and allow the correct localization of 91% of pornographic content within a video stream. The results for violence are also compelling: with the proposed approaches, we reached second place in an international competition of violent scenes detection. Putting both in perspective, we learned that pornography detection is easier than its violence counterpart, opening several opportunities for additional investigations by the research community. The main reason for such difference is related to the distinct levels of subjectivity that are inherent to each concept. While pornography is usually more explicit, violence presents a broader spectrum of possible manifestationsDoutoradoCiĂŞncia da ComputaçãoDoutor em CiĂŞncia da Computação1572763, 1197473CAPE