806 research outputs found

    Audio-visual object localization and separation using low-rank and sparsity

    Get PDF
    The ability to localize visual objects that are associated with an audio source and at the same time seperate the audio signal is a corner stone in several audio-visual signal processing applications. Past efforts usually focused on localizing only the visual objects, without audio separation abilities. Besides, they often rely computational expensive pre-processing steps to segment images pixels into object regions before applying localization approaches. We aim to address the problem of audio-visual source localization and separation in an unsupervised manner. The proposed approach employs low-rank in order to model the background visual and audio information and sparsity in order to extract the sparsely correlated components between the audio and visual modalities. In particular, this model decomposes each dataset into a sum of two terms: the low-rank matrices capturing the background uncorrelated information, while the sparse correlated components modelling the sound source in visual modality and the associated sound in audio modality. To this end a novel optimization problem, involving the minimization of nuclear norms and matrix â„“1-norms is solved. We evaluated the proposed method in 1) visual localization and audio separation and 2) visual-assisted audio denoising. The experimental results demonstrate the effectiveness of the proposed method

    Adhesion, friction, and wear of plasma-deposited thin silicon nitride films at temperatures to 700 C

    Get PDF
    The adhesion, friction, and wear behavior of silicon nitride films deposited by low- and high-frequency plasmas (30 kHz and 13.56 MHz) at various temperatures to 700 C in vacuum were examined. The results of the investigation indicated that the Si/N ratios were much greater for the films deposited at 13.56 MHz than for those deposited at 30 kHz. Amorphous silicon was present in both low- and high-frequency plasma-deposited silicon nitride films. However, more amorphous silicon occurred in the films deposited at 13.56 MHz than in those deposited at 30 kHz. Temperature significantly influenced adhesion, friction, and wear of the silicon nitride films. Wear occurred in the contact area at high temperature. The wear correlated with the increase in adhesion and friction for the low- and high-frequency plasma-deposited films above 600 and 500 C, respectively. The low- and high-frequency plasma-deposited thin silicon nitride films exhibited a capability for lubrication (low adhesion and friction) in vacuum at temperatures to 500 and 400 C, respectively

    Spontaneous vs. posed facial behavior: automatic analysis of brow actions

    Get PDF
    Past research on automatic facial expression analysis has focused mostly on the recognition of prototypic expressions of discrete emotions rather than on the analysis of dynamic changes over time, although the importance of temporal dynamics of facial expressions for interpretation of the observed facial behavior has been acknowledged for over 20 years. For instance, it has been shown that the temporal dynamics of spontaneous and volitional smiles are fundamentally different from each other. In this work, we argue that the same holds for the temporal dynamics of brow actions and show that velocity, duration, and order of occurrence of brow actions are highly relevant parameters for distinguishing posed from spontaneous brow actions. The proposed system for discrimination between volitional and spontaneous brow actions is based on automatic detection of Action Units (AUs) and their temporal segments (onset, apex, offset) produced by movements of the eyebrows. For each temporal segment of an activated AU, we compute a number of mid-level feature parameters including the maximal intensity, duration, and order of occurrence. We use Gentle Boost to select the most important of these parameters. The selected parameters are used further to train Relevance Vector Machines to determine per temporal segment of an activated AU whether the action was displayed spontaneously or volitionally. Finally, a probabilistic decision function determines the class (spontaneous or posed) for the entire brow action. When tested on 189 samples taken from three different sets of spontaneous and volitional facial data, we attain a 90.7 % correct recognition rate. Categories and Subject Descriptors I.2.10 [Vision and Scene Understanding]: motion, modeling and recovery of physical attribute

    Time-dependent reduction of structural complexity of the buccal epithelial cell nuclei after treatment with silver nanoparticles

    Get PDF
    Recent studies have suggested that silver nanoparticles (AgNPs) may affect cell DNA structure in in vitro conditions. In this paper, we present the results indicating that AgNPs change nuclear complexity properties in isolated human epithelial buccal cells in a time-dependent manner. Epithelial buccal cells were plated in special tissue culture chamber / slides and were kept at 37°C in an RPMI 1640 cell culture medium supplemented with L-glutamine. The cells were treated with colloidal silver nanoparticles suspended in RPMI 1640 medium at the concentration 15 mg L−1. Digital micrographs of the cell nuclei in a sample of 30 cells were created at five different time steps: before the treatment (controls), immediately after the treatment, as well as 15 , 30 and 60 min after the treatment with AgNPs. For each nuclear structure, values of fractal dimension, lacunarity, circularity, as well as parameters of grey level co-occurrence matrix (GLCM) texture, were determined. The results indicate time-dependent reduction of structural complexity in the cell nuclei after the contact with AgNPs. These findings further suggest that AgNPs, at concentrations present in today's over-the-counter drug products, might have significant effects on the cell genetic material

    Electron beam induced damage in PECVD Si3N4 and SiO2 films on InP

    Get PDF
    Phosphorus rich plasma enhanced chemical vapor deposition (PECVD) of silicon nitride and silicon dioxide films on n-type indium phosphide (InP) substrates were exposed to electron beam irradiation in the 5 to 40 keV range for the purpose of characterizing the damage induced in the dielectic. The electron beam exposure was on the range of 10(exp -7) to 10(exp -3) C/sq cm. The damage to the devices was characterized by capacitance-voltage (C-V) measurements of the metal insulator semiconductor (MIS) capacitors. These results were compared to results obtained for radiation damage of thermal silicon dioxide on silicon (Si) MOS capacitors with similar exposures. The radiation induced damage in the PECVD silicon nitride films on InP was successfully annealed out in an hydrogen/nitrogen (H2/N2) ambient at 400 C for 15 min. The PECVD silicon dioxide films on InP had the least radiation damage, while the thermal silicon dioxide films on Si had the most radiation damage

    Disentangling geometry and appearance with regularised geometry-aware generative adversarial networks

    Get PDF
    Deep generative models have significantly advanced image generation, enabling generation of visually pleasing images with realistic texture. Apart from the texture, it is the shape geometry of objects that strongly dictates their appearance. However, currently available generative models do not incorporate geometric information into the image generation process. This often yields visual objects of degenerated quality. In this work, we propose a regularized Geometry-Aware Generative Adversarial Network (GAGAN) which disentangles appearance and shape in the latent space. This regularized GAGAN enables the generation of images with both realistic texture and shape. Specifically, we condition the generator on a statistical shape prior. The prior is enforced through mapping the generated images onto a canonical coordinate frame using a differentiable geometric transformation. In addition to incorporating geometric information, this constrains the search space and increases the model’s robustness. We show that our approach is versatile, able to generalise across domains (faces, sketches, hands and cats) and sample sizes (from as little as ∼200-30,000 to more than 200, 000). We demonstrate superior performance through extensive quantitative and qualitative experiments in a variety of tasks and settings. Finally, we leverage our model to automatically and accurately detect errors or drifting in facial landmarks detection and tracking in-the-wild

    Face Mask Extraction in Video Sequence

    Get PDF
    Inspired by the recent development of deep network-based methods in semantic image segmentation, we introduce an end-to-end trainable model for face mask extraction in video sequence. Comparing to landmark-based sparse face shape representation, our method can produce the segmentation masks of individual facial components, which can better reflect their detailed shape variations. By integrating Convolutional LSTM (ConvLSTM) algorithm with Fully Convolutional Networks (FCN), our new ConvLSTM-FCN model works on a per-sequence basis and takes advantage of the temporal correlation in video clips. In addition, we also propose a novel loss function, called Segmentation Loss, to directly optimise the Intersection over Union (IoU) performances. In practice, to further increase segmentation accuracy, one primary model and two additional models were trained to focus on the face, eyes, and mouth regions, respectively. Our experiment shows the proposed method has achieved a 16.99% relative improvement (from 54.50% to 63.76% mean IoU) over the baseline FCN model on the 300 Videos in the Wild (300VW) dataset

    Cost-effective solution to synchronised audio-visual data capture using multiple sensors

    Get PDF
    Applications such as surveillance and human behaviour analysis require high- bandwidth recording from multiple cameras, as well as from other sensors. In turn, sensor fusion has increased the required accuracy of synchronisation be- tween sensors. Using commercial off-the-shelf components may compromise quality and accuracy, because it is difficult to handle the combined data rate from multiple sensors, the offset and rate discrepancies between independent hardware clocks, the absence of trigger inputs or -outputs in the hardware, as well as the different methods for timestamping the recorded data. To achieve accurate synchronisation, we centralise the synchronisation task by recording all trigger- or timestamp signals with a multi-channel audio interface. For sensors that don’t have an external trigger signal, we let the computer that captures the sensor data periodically generate timestamp signals from its se- rial port output. These signals can also be used as a common time base to synchronise multiple asynchronous audio interfaces. Furthermore, we show that a consumer PC can currently capture 8-bit video data with 1024x1024 spatial- and 59.1Hz temporal resolution, from at least 14 cameras, together with 8 channels of 24-bit audio at 96kHz. We thus improve the quality/cost ratio of multi-sensor systems data capture systems

    Blind audio-visual localization and separation via low-rank and sparsity

    Get PDF
    The ability to localize visual objects that are associated with an audio source and at the same time to separate the audio signal is a cornerstone in audio-visual signal-processing applications. However, available methods mainly focus on localizing only the visual objects, without audio separation abilities. Besides that, these methods often rely on either laborious preprocessing steps to segment video frames into semantic regions, or additional supervisions to guide their localization. In this paper, we aim to address the problem of visual source localization and audio separation in an unsupervised manner and avoid all preprocessing or post-processing steps. To this end, we devise a novel structured matrix decomposition method that decomposes the data matrix of each modality as a superposition of three terms: 1) a low-rank matrix capturing the background information; 2) a sparse matrix capturing the correlated components among the two modalities and, hence, uncovering the sound source in visual modality and the associated sound in audio modality; and 3) a third sparse matrix accounting for uncorrelated components, such as distracting objects in visual modality and irrelevant sound in audio modality. The generality of the proposed method is demonstrated by applying it onto three applications, namely: 1) visual localization of a sound source; 2) visually assisted audio separation; and 3) active speaker detection. Experimental results indicate the effectiveness of the proposed method on these application domains

    Visual-only recognition of normal, whispered and silent speech

    Get PDF
    Silent speech interfaces have been recently proposed as a way to enable communication when the acoustic signal is not available. This introduces the need to build visual speech recognition systems for silent and whispered speech. However, almost all the recently proposed systems have been trained on vocalised data only. This is in contrast with evidence in the literature which suggests that lip movements change depending on the speech mode. In this work, we introduce a new audiovisual database which is publicly available and contains normal, whispered and silent speech. To the best of our knowledge, this is the first study which investigates the differences between the three speech modes using the visual modality only. We show that an absolute decrease in classification rate of up to 3.7% is observed when training and testing on normal and whispered, respectively, and vice versa. An even higher decrease of up to 8.5% is reported when the models are tested on silent speech. This reveals that there are indeed visual differences between the 3 speech modes and the common assumption that vocalized training data can be used directly to train a silent speech recognition system may not be true
    • …
    corecore