237 research outputs found

    A multi-camera approach to image-based rendering and 3-D/Multiview display of ancient chinese artifacts

    Get PDF
    published_or_final_versio

    Backward Compatible Spatialized Teleconferencing based on Squeezed Recordings

    Get PDF
    Commercial teleconferencing systems currently available, although offering sophisticated video stimulus of the remote participants, commonly employ only mono or stereo audio playback for the user. However, in teleconferencing applications where there are multiple participants at multiple sites, spatializing the audio reproduced at each site (using headphones or loudspeakers) to assist listeners to distinguish between participating speakers can significantly improve the meeting experience (Baldis, 2001; Evans et al., 2000; Ward & Elko 1999; Kilgore et al., 2003; Wrigley et al., 2009; James & Hawksford, 2008). An example is Vocal Village (Kilgore et al., 2003), which uses online avatars to co-locate remote participants over the Internet in virtual space with audio spatialized over headphones (Kilgore, et al., 2003). This system adds speaker location cues to monaural speech to create a user manipulable soundfield that matches the avatar’s position in the virtual space. Giving participants the freedom to manipulate the acoustic location of other participants in the rendered sound scene that they experience has been shown to provide for improved multitasking performance (Wrigley et al., 2009). A system for multiparty teleconferencing requires firstly a stage for recording speech from multiple participants at each site. These signals then need to be compressed to allow for efficient transmission of the spatial speech. One approach is to utilise close-talking microphones to record each participant (e.g. lapel microphones), and then encode each speech signal separately prior to transmission (James & Hawksford, 2008). Alternatively, for increased flexibility, a microphone array located at a central point on, say, a meeting table can be used to generate a multichannel recording of the meeting speech A microphone array approach is adopted in this work and allows for processing of the recordings to identify relative spatial locations of the sources as well as multichannel speech enhancement techniques to improve the quality of recordings in noisy environments. For efficient transmission of the recorded signals, the approach also requires a multichannel compression technique suitable to spatially recorded speech signals

    Visualization techniques to aid in the analysis of multi-spectral astrophysical data sets

    Get PDF
    This report describes our project activities for the period Sep. 1991 - Oct. 1992. Our activities included stabilizing the software system STAR, porting STAR to IDL/widgets (improved user interface), targeting new visualization techniques for multi-dimensional data visualization (emphasizing 3D visualization), and exploring leading-edge 3D interface devices. During the past project year we emphasized high-end visualization techniques, by exploring new tools offered by state-of-the-art visualization software (such as AVS3 and IDL4/widgets), by experimenting with tools still under research at the Department of Computer Science (e.g., use of glyphs for multidimensional data visualization), and by researching current 3D input/output devices as they could be used to explore 3D astrophysical data. As always, any project activity is driven by the need to interpret astrophysical data more effectively

    PEA265: Perceptual Assessment of Video Compression Artifacts

    Full text link
    The most widely used video encoders share a common hybrid coding framework that includes block-based motion estimation/compensation and block-based transform coding. Despite their high coding efficiency, the encoded videos often exhibit visually annoying artifacts, denoted as Perceivable Encoding Artifacts (PEAs), which significantly degrade the visual Qualityof- Experience (QoE) of end users. To monitor and improve visual QoE, it is crucial to develop subjective and objective measures that can identify and quantify various types of PEAs. In this work, we make the first attempt to build a large-scale subjectlabelled database composed of H.265/HEVC compressed videos containing various PEAs. The database, namely the PEA265 database, includes 4 types of spatial PEAs (i.e. blurring, blocking, ringing and color bleeding) and 2 types of temporal PEAs (i.e. flickering and floating). Each containing at least 60,000 image or video patches with positive and negative labels. To objectively identify these PEAs, we train Convolutional Neural Networks (CNNs) using the PEA265 database. It appears that state-of-theart ResNeXt is capable of identifying each type of PEAs with high accuracy. Furthermore, we define PEA pattern and PEA intensity measures to quantify PEA levels of compressed video sequence. We believe that the PEA265 database and our findings will benefit the future development of video quality assessment methods and perceptually motivated video encoders.Comment: 10 pages,15 figures,4 table

    Anahita: A System for 3D Video Streaming with Depth Customization

    Get PDF
    Producing high-quality stereoscopic 3D content requires significantly more effort than preparing regular video footage. In order to assure good depth perception and visual comfort, 3D videos need to be carefully adjusted to specific viewing conditions before they are shown to viewers. While most stereoscopic 3D content is designed for viewing in movie theaters, where viewing conditions do not vary significantly, adapting the same content for viewing on home TV-sets, desktop displays, laptops, and mobile devices requires additional adjustments. To address this challenge, we propose a new system for 3D video streaming that provides automatic depth adjustments as one of its key features. Our system takes into account both the content and the display type in order to customize 3D videos and maximize their perceived quality. We propose a novel method for depth adjustment that is well-suited for videos of field sports such as soccer, football, and tennis. Our method is computationally efficient and it does not introduce any visual artifacts. We have implemented our 3D streaming system and conducted two user studies, which show: (i) adapting stereoscopic 3D videos for different displays is beneficial, and (ii) our proposed system can achieve up to 35% improvement in the perceived quality of the stereoscopic 3D content

    Predicting pedestrian crossing intentions using contextual information

    Get PDF
    El entorno urbano es uno de los escenarios m as complejos para un veh culo aut onomo, ya que lo comparte con otros tipos de usuarios conocidos como usuarios vulnerables de la carretera, con los peatones como mayor representante. Estos usuarios se caracterizan por su gran dinamicidad. A pesar del gran n umero de interacciones entre veh culos y peatones, la seguridad de estos ultimos no ha aumentado al mismo ritmo que la de los ocupantes de los veh culos. Por esta raz on, es necesario abordar este problema. Una posible estrategia estar a basada en conseguir que los veh culos anticipen el comportamiento de los peatones para minimizar situaciones de riesgo, especialmente presentes en el momento de cruce. El objetivo de esta tesis doctoral es alcanzar dicha anticipaci on mediante el desarrollo de t ecnicas de predicci on de la acci on de cruce de peatones basadas en aprendizaje profundo. Previo al dise~no e implementaci on de los sistemas de predicci on, se ha desarrollado un sistema de clasi caci on con el objetivo de discernir a los peatones involucrados en la escena vial. El sistema, basado en redes neuronales convolucionales, ha sido entrenado y validado con un conjunto de datos personalizado. Dicho conjunto se ha construido a partir de varios conjuntos existentes y aumentado mediante la inclusi on de im agenes obtenidas de internet. Este paso previo a la anticipaci on permitir a reducir el procesamiento innecesario dentro del sistema de percepci on del veh culo. Tras este paso, se han desarrollado dos sistemas como propuesta para abordar el problema de predicci on. El primer sistema, basado en redes convolucionales y recurrentes, obtiene una predicci on a corto plazo de la acci on de cruce realizada un segundo en el futuro. La informaci on de entrada al modelo est a basada principalmente en imagen, que permite aportar contexto adicional del peat on. Adem as, el uso de otras variables relacionadas con el peat on junto con mejoras en la arquitectura, permiten mejorar considerablemente los resultados en el conjunto de datos JAAD. El segundo sistema se basa en una arquitectura end-to-end basado en la combinaci on de redes neuronales convolucionales tridimensionales y/o el codi cador de la arquitectura Transformer. En este modelo, a diferencia del anterior, la mayor a de las mejoras est an centradas en transformaciones de los datos de entrada. Tras analizar dichas mejoras, una serie de modelos se han evaluado y comparado con otros m etodos utilizando tanto el conjunto de datos JAAD como PIE. Los resultados obtenidos han conseguido liderar el estado del arte, validando la arquitectura propuesta.The urban environment is one of the most complex scenarios for an autonomous vehicle, as it is shared with other types of users known as vulnerable road users, with pedestrians as their principal representative. These users are characterized by their great dynamicity. Despite a large number of interactions between vehicles and pedestrians, the safety of pedestrians has not increased at the same rate as that of vehicle occupants. For this reason, it is necessary to address this problem. One possible strategy would be anticipating pedestrian behavior to minimize risky situations, especially during the crossing. The objective of this doctoral thesis is to achieve such anticipation through the development of crosswalk action prediction techniques based on deep learning. Before the design and implementation of the prediction systems, a classi cation system has been developed to discern the pedestrians involved in the road scene. The system, based on convolutional neural networks, has been trained and validated with a customized dataset. This set has been built from several existing sets and augmented by including images obtained from the Internet. This pre-anticipation step would reduce unnecessary processing within the vehicle perception system. After this step, two systems have been developed as a proposal to solve the prediction problem. The rst system is composed of convolutional and recurrent encoder networks. It obtains a short-term prediction of the crossing action performed one second in the future. The input information to the model is mainly image-based, which provides additional pedestrian context. In addition, the use of pedestrian-related variables and architectural improvements allows better results on the JAAD dataset. The second system is an end-to-end architecture based on the combination of threedimensional convolutional neural networks and/or the Transformer architecture encoder. In this model, most of the proposed and investigated improvements are focused on transformations of the input data. After an extensive set of individual tests, several models have been trained, evaluated, and compared with other methods using both JAAD and PIE datasets. Obtained results are among the best state-of-the-art models, validating the proposed architecture

    LAVSS: Location-Guided Audio-Visual Spatial Audio Separation

    Full text link
    Existing machine learning research has achieved promising results in monaural audio-visual separation (MAVS). However, most MAVS methods purely consider what the sound source is, not where it is located. This can be a problem in VR/AR scenarios, where listeners need to be able to distinguish between similar audio sources located in different directions. To address this limitation, we have generalized MAVS to spatial audio separation and proposed LAVSS: a location-guided audio-visual spatial audio separator. LAVSS is inspired by the correlation between spatial audio and visual location. We introduce the phase difference carried by binaural audio as spatial cues, and we utilize positional representations of sounding objects as additional modality guidance. We also leverage multi-level cross-modal attention to perform visual-positional collaboration with audio features. In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation. This exploits the correlation between monaural and binaural channels. Experiments on the FAIR-Play dataset demonstrate the superiority of the proposed LAVSS over existing benchmarks of audio-visual separation. Our project page: https://yyx666660.github.io/LAVSS/.Comment: Accepted by WACV202

    Automated Visual Database Creation For A Ground Vehicle Simulator

    Get PDF
    This research focuses on extracting road models from stereo video sequences taken from a moving vehicle. The proposed method combines color histogram based segmentation, active contours (snakes) and morphological processing to extract road boundary coordinates for conversion into Matlab or Multigen OpenFlight compatible polygonal representations. Color segmentation uses an initial truth frame to develop a color probability density function (PDF) of the road versus the terrain. Subsequent frames are segmented using a Maximum Apostiori Probability (MAP) criteria and the resulting templates are used to update the PDFs. Color segmentation worked well where there was minimal shadowing and occlusion by other cars. A snake algorithm was used to find the road edges which were converted to 3D coordinates using stereo disparity and vehicle position information. The resulting 3D road models were accurate to within 1 meter

    Predictive World Models from Real-World Partial Observations

    Full text link
    Cognitive scientists believe adaptable intelligent agents like humans perform reasoning through learned causal mental simulations of agents and environments. The problem of learning such simulations is called predictive world modeling. Recently, reinforcement learning (RL) agents leveraging world models have achieved SOTA performance in game environments. However, understanding how to apply the world modeling approach in complex real-world environments relevant to mobile robots remains an open question. In this paper, we present a framework for learning a probabilistic predictive world model for real-world road environments. We implement the model using a hierarchical VAE (HVAE) capable of predicting a diverse set of fully observed plausible worlds from accumulated sensor observations. While prior HVAE methods require complete states as ground truth for learning, we present a novel sequential training method to allow HVAEs to learn to predict complete states from partially observed states only. We experimentally demonstrate accurate spatial structure prediction of deterministic regions achieving 96.21 IoU, and close the gap to perfect prediction by 62% for stochastic regions using the best prediction. By extending HVAEs to cases where complete ground truth states do not exist, we facilitate continual learning of spatial prediction as a step towards realizing explainable and comprehensive predictive world models for real-world mobile robotics applications. Code is available at https://github.com/robin-karlsson0/predictive-world-models.Comment: Accepted for IEEE MOST 202
    corecore