138,750 research outputs found

    Identifying Cover Songs Using Information-Theoretic Measures of Similarity

    Get PDF
    This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/This paper investigates methods for quantifying similarity between audio signals, specifically for the task of cover song detection. We consider an information-theoretic approach, where we compute pairwise measures of predictability between time series. We compare discrete-valued approaches operating on quantized audio features, to continuous-valued approaches. In the discrete case, we propose a method for computing the normalized compression distance, where we account for correlation between time series. In the continuous case, we propose to compute information-based measures of similarity as statistics of the prediction error between time series. We evaluate our methods on two cover song identification tasks using a data set comprised of 300 Jazz standards and using the Million Song Dataset. For both datasets, we observe that continuous-valued approaches outperform discrete-valued approaches. We consider approaches to estimating the normalized compression distance (NCD) based on string compression and prediction, where we observe that our proposed normalized compression distance with alignment (NCDA) improves average performance over NCD, for sequential compression algorithms. Finally, we demonstrate that continuous-valued distances may be combined to improve performance with respect to baseline approaches. Using a large-scale filter-and-refine approach, we demonstrate state-of-the-art performance for cover song identification using the Million Song Dataset.The work of P. Foster was supported by an Engineering and Physical Sciences Research Council Doctoral Training Account studentship

    Compression Methods for Structured Floating-Point Data and their Application in Climate Research

    Get PDF
    The use of new technologies, such as GPU boosters, have led to a dramatic increase in the computing power of High-Performance Computing (HPC) centres. This development, coupled with new climate models that can better utilise this computing power thanks to software development and internal design, led to the bottleneck moving from solving the differential equations describing Earth’s atmospheric interactions to actually storing the variables. The current approach to solving the storage problem is inadequate: either the number of variables to be stored is limited or the temporal resolution of the output is reduced. If it is subsequently determined that another vari- able is required which has not been saved, the simulation must run again. This thesis deals with the development of novel compression algorithms for structured floating-point data such as climate data so that they can be stored in full resolution. Compression is performed by decorrelation and subsequent coding of the data. The decorrelation step eliminates redundant information in the data. During coding, the actual compression takes place and the data is written to disk. A lossy compression algorithm additionally has an approx- imation step to unify the data for better coding. The approximation step reduces the complexity of the data for the subsequent coding, e.g. by using quantification. This work makes a new scientific contribution to each of the three steps described above. This thesis presents a novel lossy compression method for time-series data using an Auto Regressive Integrated Moving Average (ARIMA) model to decorrelate the data. In addition, the concept of information spaces and contexts is presented to use information across dimensions for decorrela- tion. Furthermore, a new coding scheme is described which reduces the weaknesses of the eXclusive-OR (XOR) difference calculation and achieves a better compression factor than current lossless compression methods for floating-point numbers. Finally, a modular framework is introduced that allows the creation of user-defined compression algorithms. The experiments presented in this thesis show that it is possible to in- crease the information content of lossily compressed time-series data by applying an adaptive compression technique which preserves selected data with higher precision. An analysis for lossless compression of these time- series has shown no success. However, the lossy ARIMA compression model proposed here is able to capture all relevant information. The reconstructed data can reproduce the time-series to such an extent that statistically rele- vant information for the description of climate dynamics is preserved. Experiments indicate that there is a significant dependence of the com- pression factor on the selected traversal sequence and the underlying data model. The influence of these structural dependencies on prediction-based compression methods is investigated in this thesis. For this purpose, the concept of Information Spaces (IS) is introduced. IS contributes to improv- ing the predictions of the individual predictors by nearly 10% on average. Perhaps more importantly, the standard deviation of compression results is on average 20% lower. Using IS provides better predictions and consistent compression results. Furthermore, it is shown that shifting the prediction and true value leads to a better compression factor with minimal additional computational costs. This allows the use of more resource-efficient prediction algorithms to achieve the same or better compression factor or higher throughput during compression or decompression. The coding scheme proposed here achieves a better compression factor than current state-of-the-art methods. Finally, this paper presents a modular framework for the development of compression algorithms. The framework supports the creation of user- defined predictors and offers functionalities such as the execution of bench- marks, the random subdivision of n-dimensional data, the quality evalua- tion of predictors, the creation of ensemble predictors and the execution of validity tests for sequential and parallel compression algorithms. This research was initiated because of the needs of climate science, but the application of its contributions is not limited to it. The results of this the- sis are of major benefit to develop and improve any compression algorithm for structured floating-point data

    IDENTIFICATION OF COVER SONGS USING INFORMATION THEORETIC MEASURES OF SIMILARITY

    Get PDF
    13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted versio

    Compression and Conditional Emulation of Climate Model Output

    Full text link
    Numerical climate model simulations run at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus it is important to develop methods for representing the full datasets by smaller compressed versions. We propose a statistical compression and decompression algorithm based on storing a set of summary statistics as well as a statistical model describing the conditional distribution of the full dataset given the summary statistics. The statistical model can be used to generate realizations representing the full dataset, along with characterizations of the uncertainties in the generated data. Thus, the methods are capable of both compression and conditional emulation of the climate models. Considerable attention is paid to accurately modeling the original dataset--one year of daily mean temperature data--particularly with regard to the inherent spatial nonstationarity in global fields, and to determining the statistics to be stored, so that the variation in the original data can be closely captured, while allowing for fast decompression and conditional emulation on modest computers
    corecore