83 research outputs found

    Bitwise Source Separation on Hashed Spectra: An Efficient Posterior Estimation Scheme Using Partial Rank Order Metrics

    Full text link
    This paper proposes an efficient bitwise solution to the single-channel source separation task. Most dictionary-based source separation algorithms rely on iterative update rules during the run time, which becomes computationally costly especially when we employ an overcomplete dictionary and sparse encoding that tend to give better separation results. To avoid such cost we propose a bitwise scheme on hashed spectra that leads to an efficient posterior probability calculation. For each source, the algorithm uses a partial rank order metric to extract robust features that form a binarized dictionary of hashed spectra. Then, for a mixture spectrum, its hash code is compared with each source's hashed dictionary in one pass. This simple voting-based dictionary search allows a fast and iteration-free estimation of ratio masking at each bin of a signal spectrogram. We verify that the proposed BitWise Source Separation (BWSS) algorithm produces sensible source separation results for the single-channel speech denoising task, with 6-8 dB mean SDR. To our knowledge, this is the first dictionary based algorithm for this task that is completely iteration-free in both training and testing

    Audio computing in the wild: frameworks for big data and small computers

    Get PDF
    This dissertation presents some machine learning algorithms that are designed to process as much data as needed while spending the least possible amount of resources, such as time, energy, and memory. Examples of those applications, but not limited to, can be a large-scale multimedia information retrieval system where both queries and the items in the database are noisy signals; collaborative audio enhancement from hundreds of user-created clips of a music concert; an event detection system running in a small device that has to process various sensor signals in real time; a lightweight custom chipset for speech enhancement on hand-held devices; instant music analysis engine running on smartphone apps. In all those applications, efficient machine learning algorithms are supposed to achieve not only a good performance, but also a great resource-efficiency. We start from some efficient dictionary-based single-channel source separation algorithms. We can train this kind of source-specific dictionaries by using some matrix factorization or topic modeling, whose elements form a representative set of spectra for the particular source. During the test time, the system estimates the contribution of the participating dictionary items for an unknown mixture spectrum. In this way we can estimate the activation of each source separately, and then recover the source of interest by using that particular source's reconstruction. There are some efficiency issues during this procedure. First off, searching for the optimal dictionary size is time consuming. Although for some very common types of sources, e.g. English speech, we know the optimal rank of the model by trial and error, it is hard to know in advance as to what is the optimal number of dictionary elements for the unknown sources, which are usually modeled during the test time in the semi-supervised separation scenarios. On top of that, when it comes to the non-stationary unknown sources, we had better maintain a dictionary that adapts its size and contents to the change of the source's nature. In this online semi-supervised separation scenario, a mechanism that can efficiently learn the optimal rank is helpful. To this end, a deflation method is proposed for modeling this unknown source with a nonnegative dictionary whose size is optimal. Since it has to be done during the test time, the deflation method that incrementally adds up new dictionary items shows better efficiency than a corresponding na\"ive approach where we simply try a bunch of different models. We have another efficiency issue when we are to use a large dictionary for better separation. It has been known that considering the manifold of the training data can help enhance the performance for the separation. This is because of the symptom that the usual manifold-ignorant convex combination models, such as from low-rank matrix decomposition or topic modeling, tend to result in ambiguous regions in the source-specific subspace defined by the dictionary items as the bases. For example, in those ambiguous regions, the original data samples cannot reside. Although some source separation techniques that respect data manifold could increase the performance, they call for more memory and computational resources due to the fact that the models call for larger dictionaries and involve sparse coding during the test time. This limitation led the development of hashing-based encoding of the audio spectra, so that some computationally heavy routines, such as nearest neighbor searches for sparse coding, can be performed in a cheaper bit-wise fashion. Matching audio signals can be challenging as well, especially if the signals are noisy and the matching task involves a big amount of signals. If it is an information retrieval application, for example, the bigger size of the data leads to a longer response time. On top of that, if the signals are defective, we have to perform the enhancement or separation job in the first place before matching, or we might need a matching mechanism that is robust to all those different kinds of artifacts. Likewise, the noisy nature of signals can add an additional complexity to the system. In this dissertation we will also see some compact integer (and eventually binary) representations for those matching systems. One of the possible compact representations would be a hashing-based matching method, where we can employ a particular kind of hash functions to preserve the similarity among original signals in the hash code domain. We will see that a variant of Winner Take All hashing can provide Hamming distance from noise-robust binary features, and that matching using the hash codes works well for some keyword spotting tasks. From the fact that some landmark hashes (e.g. local maxima from non-maximum suppression on the magnitudes of a mel-scaled spectrogram) can also robustly represent the time-frequency domain signal efficiently, a matrix decomposition algorithm is also proposed to take those irregular sparse matrices as input. Based on the assumption that the number of landmarks is a lot smaller than the number of all the time-frequency coefficients, we can think of this matching algorithm efficient if it operates entirely on the landmark representation. On the contrary to the usual landmark matching schemes, where matching is defined rigorously, we see the audio matching problem as soft matching where we find a similar constellation of landmarks to the query. In order to perform this soft matching job, the landmark positions are smoothed by a fixed-width Gaussian caps, with which the matching job is reduced down to calculating the amount of overlaps in-between those Gaussians. The Gaussian-based density approximation is also useful when we perform decomposition on this landmark representation, because otherwise the landmarks are usually too sparse to perform an ordinary matrix factorization algorithm, which are originally for a dense input matrix. We also expand this concept to the matrix deconvolution problem as well, where we see the input landmark representation of a source as a two-dimensional convolution between a source pattern and its corresponding sparse activations. If there are more than one source, as a noisy signal, we can think of this problem as factor deconvolution where the mixture is the combination of all the source-specific convolutions. The dissertation also covers Collaborative Audio Enhancement (CAE) algorithms that aim to recover the dominant source at a sound scene (e.g. music signals of a concert rather than the noise from the crowd) from multiple low-quality recordings (e.g. Youtube video clips uploaded by the audience). CAE can be seen as crowdsourcing a recording job, which needs a substantial amount of denoising effort afterward, because the user-created recordings might have been contaminated with various artifacts. In the sense that the recordings are from not-synchronized heterogenous sensors, we can also think of CAE as big ad-hoc sensor array processing. In CAE, each recording is assumed to be uniquely corrupted by a specific frequency response of the microphone, an aggressive audio coding algorithm, interference, band-pass filtering, clipping, etc. To consolidate all these recordings and come up with an enhanced audio, Probabilistic Latent Component Sharing (PLCS) has been proposed as a method of simultaneous probabilistic topic modeling on synchronized input signals. In PLCS, some of the parameters are fixed to be same during and after the learning process to capture common audio content, while the rest of the parameters are for the unwanted recording-specific interference and artifacts. We can speed up PLCS by incorporating a hashing-based nearest neighbor search so that at every EM iteration PLCS can be applied only to a small number of recordings that are closest to the current source estimation. Experiments on a small simulated CAE setup shows that the proposed PLCS can improve the sound quality from variously contaminated recordings. The nearest neighbor search technique during PLCS provides sensible speed-up at larger scaled experiments (up to 1000 recordings). Finally, to describe an extremely optimized deep learning deployment system, Bitwise Neural Networks (BNN) will be also discussed. In the proposed BNN, all the input, hidden, and output nodes are binaries (+1 and -1), and so are all the weights and bias. Consequently, the operations on them during the test time are defined with Boolean algebra, too. BNNs are spatially and computationally efficient in implementations, since (a) we represent a real-valued sample or parameter with a bit (b) the multiplication and addition correspond to bitwise XNOR and bit-counting, respectively. Therefore, BNNs can be used to implement a deep learning system in a resource-constrained environment, so that we can deploy a deep learning system on small devices without using up the power, memory, CPU clocks, etc. The training procedure for BNNs is based on a straightforward extension of backpropagation, which is characterized by the use of the quantization noise injection scheme, and the initialization strategy that learns a weight-compressed real-valued network only for the initialization purpose. Some preliminary results on the MNIST dataset and speech denoising demonstrate that a straightforward extension of backpropagation can successfully train BNNs whose performance is comparable while necessitating vastly fewer computational resources

    Hashing for Similarity Search: A Survey

    Full text link
    Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

    Non-Negative Matrix Factorization Based Single Channel Source Separation

    Get PDF
    The significance of speech recognition systems is widespread, encompassing applications like speech translation, robotics, and security. However, these systems often encounter challenges arising from noise and source mixing during signal acquisition, leading to performance degradation. Addressing this, cutting-edge solutions must effectively incorporate temporal dependencies spanning longer periods than a single time frame. To tackle this issue, this study introduces a novel model employing non-negative matrix factorization (NMF) modelling. This technique harnesses the scattering transform, involving wavelet filters and pyramid scattering, to compute sources and mitigate undesired signals. Once signal estimation is achieved, a source separation algorithm is devised, employing an optimization process grounded in training and testing approaches. By quantifying performance metrics, a comparative analysis is conducted between existing methods and the proposed model. Results indicate the superior performance of the suggested approach, underscored by these metrics. This signifies that the NMF and scattering transform-based model adeptly addresses the challenge of effectively utilizing temporal dependencies spanning more than a single time frame, ultimately enhancing speech recognition system efficacy

    Improving k-nn search and subspace clustering based on local intrinsic dimensionality

    Get PDF
    In several novel applications such as multimedia and recommender systems, data is often represented as object feature vectors in high-dimensional spaces. The high-dimensional data is always a challenge for state-of-the-art algorithms, because of the so-called curse of dimensionality . As the dimensionality increases, the discriminative ability of similarity measures diminishes to the point where many data analysis algorithms, such as similarity search and clustering, that depend on them lose their effectiveness. One way to handle this challenge is by selecting the most important features, which is essential for providing compact object representations as well as improving the overall search and clustering performance. Having compact feature vectors can further reduce the storage space and the computational complexity of search and learning tasks. Support-Weighted Intrinsic Dimensionality (support-weighted ID) is a new promising feature selection criterion that estimates the contribution of each feature to the overall intrinsic dimensionality. Support-weighted ID identifies relevant features locally for each object, and penalizes those features that have locally lower discriminative power as well as higher density. In fact, support-weighted ID measures the ability of each feature to locally discriminate between objects in the dataset. Based on support-weighted ID, this dissertation introduces three main research contributions: First, this dissertation proposes NNWID-Descent, a similarity graph construction method that utilizes the support-weighted ID criterion to identify and retain relevant features locally for each object and enhance the overall graph quality. Second, with the aim to improve the accuracy and performance of cluster analysis, this dissertation introduces k-LIDoids, a subspace clustering algorithm that extends the utility of support-weighted ID within a clustering framework in order to gradually select the subset of informative and important features per cluster. k-LIDoids is able to construct clusters together with finding a low dimensional subspace for each cluster. Finally, using the compact object and cluster representations from NNWID-Descent and k-LIDoids, this dissertation defines LID-Fingerprint, a new binary fingerprinting and multi-level indexing framework for the high-dimensional data. LID-Fingerprint can be used for hiding the information as a way of preventing passive adversaries as well as providing an efficient and secure similarity search and retrieval for the data stored on the cloud. When compared to other state-of-the-art algorithms, the good practical performance provides an evidence for the effectiveness of the proposed algorithms for the data in high-dimensional spaces

    Service Abstractions for Scalable Deep Learning Inference at the Edge

    Get PDF
    Deep learning driven intelligent edge has already become a reality, where millions of mobile, wearable, and IoT devices analyze real-time data and transform those into actionable insights on-device. Typical approaches for optimizing deep learning inference mostly focus on accelerating the execution of individual inference tasks, without considering the contextual correlation unique to edge environments and the statistical nature of learning-based computation. Specifically, they treat inference workloads as individual black boxes and apply canonical system optimization techniques, developed over the last few decades, to handle them as yet another type of computation-intensive applications. As a result, deep learning inference on edge devices still face the ever increasing challenges of customization to edge device heterogeneity, fuzzy computation redundancy between inference tasks, and end-to-end deployment at scale. In this thesis, we propose the first framework that automates and scales the end-to-end process of deploying efficient deep learning inference from the cloud to heterogeneous edge devices. The framework consists of a series of service abstractions that handle DNN model tailoring, model indexing and query, and computation reuse for runtime inference respectively. Together, these services bridge the gap between deep learning training and inference, eliminate computation redundancy during inference execution, and further lower the barrier for deep learning algorithm and system co-optimization. To build efficient and scalable services, we take a unique algorithmic approach of harnessing the semantic correlation between the learning-based computation. Rather than viewing individual tasks as isolated black boxes, we optimize them collectively in a white box approach, proposing primitives to formulate the semantics of the deep learning workloads, algorithms to assess their hidden correlation (in terms of the input data, the neural network models, and the deployment trials) and merge common processing steps to minimize redundancy
    • …
    corecore