1,260 research outputs found

    Steered mixture-of-experts for light field images and video : representation and coding

    Get PDF
    Research in light field (LF) processing has heavily increased over the last decade. This is largely driven by the desire to achieve the same level of immersion and navigational freedom for camera-captured scenes as it is currently available for CGI content. Standardization organizations such as MPEG and JPEG continue to follow conventional coding paradigms in which viewpoints are discretely represented on 2-D regular grids. These grids are then further decorrelated through hybrid DPCM/transform techniques. However, these 2-D regular grids are less suited for high-dimensional data, such as LFs. We propose a novel coding framework for higher-dimensional image modalities, called Steered Mixture-of-Experts (SMoE). Coherent areas in the higher-dimensional space are represented by single higher-dimensional entities, called kernels. These kernels hold spatially localized information about light rays at any angle arriving at a certain region. The global model consists thus of a set of kernels which define a continuous approximation of the underlying plenoptic function. We introduce the theory of SMoE and illustrate its application for 2-D images, 4-D LF images, and 5-D LF video. We also propose an efficient coding strategy to convert the model parameters into a bitstream. Even without provisions for high-frequency information, the proposed method performs comparable to the state of the art for low-to-mid range bitrates with respect to subjective visual quality of 4-D LF images. In case of 5-D LF video, we observe superior decorrelation and coding performance with coding gains of a factor of 4x in bitrate for the same quality. At least equally important is the fact that our method inherently has desired functionality for LF rendering which is lacking in other state-of-the-art techniques: (1) full zero-delay random access, (2) light-weight pixel-parallel view reconstruction, and (3) intrinsic view interpolation and super-resolution

    Comparing human and automatic speech recognition in a perceptual restoration experiment

    Get PDF
    Speech that has been distorted by introducing spectral or temporal gaps is still perceived as continuous and complete by human listeners, so long as the gaps are filled with additive noise of sufficient intensity. When such perceptual restoration occurs, the speech is also more intelligible compared to the case in which noise has not been added in the gaps. This observation has motivated so-called 'missing data' systems for automatic speech recognition (ASR), but there have been few attempts to determine whether such systems are a good model of perceptual restoration in human listeners. Accordingly, the current paper evaluates missing data ASR in a perceptual restoration task. We evaluated two systems that use a new approach to bounded marginalisation in the cepstral domain, and a bounded conditional mean imputation method. Both methods model available speech information as a clean-speech posterior distribution that is subsequently passed to an ASR system. The proposed missing data ASR systems were evaluated using distorted speech, in which spectro-temporal gaps were optionally filled with additive noise. Speech recognition performance of the proposed systems was compared against a baseline ASR system, and with human speech recognition performance on the same task. We conclude that missing data methods improve speech recognition performance in a manner that is consistent with perceptual restoration in human listeners

    Statistical framework for video decoding complexity modeling and prediction

    Get PDF
    Video decoding complexity modeling and prediction is an increasingly important issue for efficient resource utilization in a variety of applications, including task scheduling, receiver-driven complexity shaping, and adaptive dynamic voltage scaling. In this paper we present a novel view of this problem based on a statistical framework perspective. We explore the statistical structure (clustering) of the execution time required by each video decoder module (entropy decoding, motion compensation, etc.) in conjunction with complexity features that are easily extractable at encoding time (representing the properties of each module's input source data). For this purpose, we employ Gaussian mixture models (GMMs) and an expectation-maximization algorithm to estimate the joint execution-time - feature probability density function (PDF). A training set of typical video sequences is used for this purpose in an offline estimation process. The obtained GMM representation is used in conjunction with the complexity features of new video sequences to predict the execution time required for the decoding of these sequences. Several prediction approaches are discussed and compared. The potential mismatch between the training set and new video content is addressed by adaptive online joint-PDF re-estimation. An experimental comparison is performed to evaluate the different approaches and compare the proposed prediction scheme with related resource prediction schemes from the literature. The usefulness of the proposed complexity-prediction approaches is demonstrated in an application of rate-distortion-complexity optimized decoding

    Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression

    Get PDF
    This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a locally-linear Gaussian regression model between the directional coordinates of all the sources and the auditory features extracted from binaural measurements. While fixed-length wide-spectrum sounds (white noise) are used for training to reliably estimate the model parameters, we show that the testing (localization) can be extended to variable-length sparse-spectrum sounds (such as speech), thus enabling a wide range of realistic applications. Indeed, we demonstrate that the method can be used for audio-visual fusion, namely to map speech signals onto images and hence to spatially align the audio and visual modalities, thus enabling to discriminate between speaking and non-speaking faces. We release a novel corpus of real-room recordings that allow quantitative evaluation of the co-localization method in the presence of one or two sound sources. Experiments demonstrate increased accuracy and speed relative to several state-of-the-art methods.Comment: 15 pages, 8 figure

    Advanced imaging and data mining technologies for medical and food safety applications

    Get PDF
    As one of the most fast-developing research areas, biological imaging and image analysis receive more and more attentions, and have been already widely applied in many scientific fields including medical diagnosis and food safety inspection. To further investigate such a very interesting area, this research is mainly focused on advanced imaging and pattern recognition technologies in both medical and food safety applications, which include 1) noise reduction of ultra-low-dose multi-slice helical CT imaging for early lung cancer screening, and 2) automated discrimination between walnut shell and meat under hyperspectral florescence imaging. In the medical imaging and diagnosis area, because X-ray computed tomography (CT) has been applied to screen large populations for early lung cancer detection during the last decade, more and more attentions have been paid to studying low-dose, even ultra-low-dose X-ray CTs. However, reducing CT radiation exposure inevitably increases the noise level in the sinogram, thereby degrading the quality of reconstructed CT images. Thus, how to reduce the noise levels in the low-dose CT images becomes a meaningful topic. In this research, a nonparametric smoothing method with block based thin plate smoothing splines and the roughness penalty was introduced to restore the ultra-low-dose helical CT raw data, which was acquired under 120 kVp / 10 mAs protocol. The objective thorax image quality evaluation was first conducted to assess the image quality and noise level of proposed method. A web-based subjective evaluation system was also built for the total of 23 radiologists to compare proposed approach with traditional sinogram restoration method. Both objective and subjective evaluation studies showed the effectiveness of proposed thin-plate based nonparametric regression method in sinogram restoration of multi-slice helical ultra-low-dose CT. In food quality inspection area, automated discrimination between walnut shell and meat has become an imperative task in the walnut postharvest processing industry in the U.S. This research developed two hyperspectral fluorescence imaging based approaches, which were capable of differentiating walnut small shell fragments from meat. Firstly, a principal component analysis (PCA) and Gaussian mixture model (PCA-GMM)-based Bayesian classification method was introduced. PCA was used to extract features, and then the optimal number of components in PCA was selected by a cross-validation technique. The PCA-GMM-based Bayesian classifier was further applied to differentiate the walnut shell and meat according to the class-conditional probability and the prior estimated by the Gaussian mixture model. The experimental results showed the effectiveness of this PCA-GMM approach, and an overall 98.2% recognition rate was achieved. Secondly, Gaussian-kernel based Support Vector Machine (SVM) was presented for the walnut shell and meat discrimination in the hyperspectral florescence imagery. SVM was applied to seek an optimal low to high dimensional mapping such that the nonlinear separable input data in the original input data space became separable on the mapped high dimensional space, and hence fulfilled the classification between walnut shell and meat. An overall recognition rate of 98.7% was achieved by this method. Although the hyperspectral fluorescence imaging is capable of differentiating between walnut shell and meat, one persistent problem is how to deal with huge amount of data acquired by the hyperspectral imaging system, and hence improve the efficiency of application system. To solve this problem, an Independent Component Analysis with k-Nearest Neighbor Classifier (ICA-kNN) approach was presented in this research to reduce the data redundancy while not sacrifice the classification performance too much. An overall 90.6% detection rate was achieved given 10 optimal wavelengths, which constituted only 13% of the total acquired hyperspectral image data. In order to further evaluate the proposed method, the classification results of the ICA-kNN approach were also compared to the kNN classifier method alone. The experimental results showed that the ICA-kNN method with fewer wavelengths had the same performance as the kNN classifier alone using information from all 79 wavelengths. This demonstrated the effectiveness of the proposed ICA-kNN method for the hyperspectral band selection in the walnut shell and meat classification

    Model-Based Speech Enhancement

    Get PDF
    Abstract A method of speech enhancement is developed that reconstructs clean speech from a set of acoustic features using a harmonic plus noise model of speech. This is a significant departure from traditional filtering-based methods of speech enhancement. A major challenge with this approach is to estimate accurately the acoustic features (voicing, fundamental frequency, spectral envelope and phase) from noisy speech. This is achieved using maximum a-posteriori (MAP) estimation methods that operate on the noisy speech. In each case a prior model of the relationship between the noisy speech features and the estimated acoustic feature is required. These models are approximated using speaker-independent GMMs of the clean speech features that are adapted to speaker-dependent models using MAP adaptation and for noise using the Unscented Transform. Objective results are presented to optimise the proposed system and a set of subjective tests compare the approach with traditional enhancement methods. Threeway listening tests examining signal quality, background noise intrusiveness and overall quality show the proposed system to be highly robust to noise, performing significantly better than conventional methods of enhancement in terms of background noise intrusiveness. However, the proposed method is shown to reduce signal quality, with overall quality measured to be roughly equivalent to that of the Wiener filter
    corecore