216 research outputs found
Object-Centric Slot Diffusion
The recent success of transformer-based image generative models in
object-centric learning highlights the importance of powerful image generators
for handling complex scenes. However, despite the high expressiveness of
diffusion models in image generation, their integration into object-centric
learning remains largely unexplored in this domain. In this paper, we explore
the feasibility and potential of integrating diffusion models into
object-centric learning and investigate the pros and cons of this approach. We
introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes:
it is the first object-centric learning model to replace conventional slot
decoders with a latent diffusion model conditioned on object slots, and it is
also the first unsupervised compositional conditional diffusion model that
operates without the need for supervised annotations like text. Through
experiments on various object-centric tasks, including the first application of
the FFHQ dataset in this field, we demonstrate that LSD significantly
outperforms state-of-the-art transformer-based decoders, particularly in more
complex scenes, and exhibits superior unsupervised compositional generation
quality. Project page is available at
$\href{https://latentslotdiffusion.github.io}{here}
Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources
[ES] En los últimos años, el aprendizaje profundo ha cambiado significativamente el panorama en diversas áreas del campo de la inteligencia artificial, entre las que se incluyen la visión por computador, el procesamiento del lenguaje natural, robótica o teoría de juegos. En particular, el sorprendente éxito del aprendizaje profundo en múltiples aplicaciones del campo del procesamiento del lenguaje natural tales como el reconocimiento automático del habla (ASR), la traducción automática (MT) o la síntesis de voz (TTS), ha supuesto una mejora drástica en la precisión de estos sistemas, extendiendo así su implantación a un mayor rango de aplicaciones en la vida real. En este momento, es evidente que las tecnologías de reconocimiento automático del habla y traducción automática pueden ser empleadas para producir, de forma efectiva, subtítulos multilingües de alta calidad de contenidos audiovisuales. Esto es particularmente cierto en el contexto de los vídeos educativos, donde las condiciones acústicas son normalmente favorables para los sistemas de ASR y el discurso está gramaticalmente bien formado. Sin embargo, en el caso de TTS, aunque los sistemas basados en redes neuronales han demostrado ser capaces de sintetizar voz de un realismo y calidad sin precedentes, todavía debe comprobarse si esta tecnología está lo suficientemente madura como para mejorar la accesibilidad y la participación en el aprendizaje en línea. Además, existen diversas tareas en el campo de la síntesis de voz que todavía suponen un reto, como la clonación de voz inter-lingüe, la síntesis incremental o la adaptación zero-shot a nuevos locutores. Esta tesis aborda la mejora de las prestaciones de los sistemas actuales de síntesis de voz basados en redes neuronales, así como la extensión de su aplicación en diversos escenarios, en el contexto de mejorar la accesibilidad en el aprendizaje en línea. En este sentido, este trabajo presta especial atención a la adaptación a nuevos locutores y a la clonación de voz inter-lingüe, ya que los textos a sintetizar se corresponden, en este caso, a traducciones de intervenciones originalmente en otro idioma.[CA] Durant aquests darrers anys, l'aprenentatge profund ha canviat significativament el panorama en diverses àrees del camp de la intel·ligència artificial, entre les quals s'inclouen la visió per computador, el processament del llenguatge natural, robòtica o la teoria de jocs. En particular, el sorprenent èxit de l'aprenentatge profund en múltiples aplicacions del camp del processament del llenguatge natural, com ara el reconeixement automàtic de la parla (ASR), la traducció automàtica (MT) o la síntesi de veu (TTS), ha suposat una millora dràstica en la precisió i qualitat d'aquests sistemes, estenent així la seva implantació a un ventall més ampli a la vida real. En aquest moment, és evident que les tecnologies de reconeixement automàtic de la parla i traducció automàtica poden ser emprades per a produir, de forma efectiva, subtítols multilingües d'alta qualitat de continguts audiovisuals. Això és particularment cert en el context dels vídeos educatius, on les condicions acústiques són normalment favorables per als sistemes d'ASR i el discurs està gramaticalment ben format. No obstant això, al cas de TTS, encara que els sistemes basats en xarxes neuronals han demostrat ser capaços de sintetitzar veu d'un realisme i qualitat sense precedents, encara s'ha de comprovar si aquesta tecnologia és ja prou madura com per millorar l'accessibilitat i la participació en l'aprenentatge en línia. A més, hi ha diverses tasques al camp de la síntesi de veu que encara suposen un repte, com ara la clonació de veu inter-lingüe, la síntesi incremental o l'adaptació zero-shot a nous locutors. Aquesta tesi aborda la millora de les prestacions dels sistemes actuals de síntesi de veu basats en xarxes neuronals, així com l'extensió de la seva aplicació en diversos escenaris, en el context de millorar l'accessibilitat en l'aprenentatge en línia. En aquest sentit, aquest treball presta especial atenció a l'adaptació a nous locutors i a la clonació de veu interlingüe, ja que els textos a sintetitzar es corresponen, en aquest cas, a traduccions d'intervencions originalment en un altre idioma.[EN] In recent years, deep learning has fundamentally changed the landscapes of a number of areas in artificial intelligence, including computer vision, natural language processing, robotics, and game theory. In particular, the striking success of deep learning in a large variety of natural language processing (NLP) applications, including automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS), has resulted in major accuracy improvements, thus widening the applicability of these technologies in real-life settings. At this point, it is clear that ASR and MT technologies can be utilized to produce cost-effective, high-quality multilingual subtitles of video contents of different kinds. This is particularly true in the case of transcription and translation of video lectures and other kinds of educational materials, in which the audio recording conditions are usually favorable for the ASR task, and there is a grammatically well-formed speech. However, although state-of-the-art neural approaches to TTS have shown to drastically improve the naturalness and quality of synthetic speech over conventional concatenative and parametric systems, it is still unclear whether this technology is already mature enough to improve accessibility and engagement in online learning, and particularly in the context of higher education. Furthermore, advanced topics in TTS such as cross-lingual voice cloning, incremental TTS or zero-shot speaker adaptation remain an open challenge in the field. This thesis is about enhancing the performance and widening the applicability of modern neural TTS technologies in real-life settings, both in offline and streaming conditions, in the context of improving accessibility and engagement in online learning. Thus, particular emphasis is placed on speaker adaptation and cross-lingual voice cloning, as the input text corresponds to a translated utterance in this context.Pérez González De Martos, AM. (2022). Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/184019TESISPremios Extraordinarios de tesis doctorale
Distributional Drift Adaptation with Temporal Conditional Variational Autoencoder for Multivariate Time Series Forecasting
Due to the nonstationary nature, the distribution of real-world multivariate
time series (MTS) changes over time, which is known as distribution drift. Most
existing MTS forecasting models greatly suffer from distribution drift and
degrade the forecasting performance over time. Existing methods address
distribution drift via adapting to the latest arrived data or self-correcting
per the meta knowledge derived from future data. Despite their great success in
MTS forecasting, these methods hardly capture the intrinsic distribution
changes, especially from a distributional perspective. Accordingly, we propose
a novel framework temporal conditional variational autoencoder (TCVAE) to model
the dynamic distributional dependencies over time between historical
observations and future data in MTSs and infer the dependencies as a temporal
conditional distribution to leverage latent variables. Specifically, a novel
temporal Hawkes attention mechanism represents temporal factors subsequently
fed into feed-forward networks to estimate the prior Gaussian distribution of
latent variables. The representation of temporal factors further dynamically
adjusts the structures of Transformer-based encoder and decoder to distribution
changes by leveraging a gated attention mechanism. Moreover, we introduce
conditional continuous normalization flow to transform the prior Gaussian to a
complex and form-free distribution to facilitate flexible inference of the
temporal conditional distribution. Extensive experiments conducted on six
real-world MTS datasets demonstrate the TCVAE's superior robustness and
effectiveness over the state-of-the-art MTS forecasting baselines. We further
illustrate the TCVAE applicability through multifaceted case studies and
visualization in real-world scenarios.Comment: 13 pages, 6 figures, submitted to IEEE Transactions on Neural
Networks and Learning Systems (TNNLS
Low-Complexity Near-Optimum Symbol Detection Based on Neural Enhancement of Factor Graphs
We consider the application of the factor graph framework for symbol
detection on linear inter-symbol interference channels. Based on the Ungerboeck
observation model, a detection algorithm with appealing complexity properties
can be derived. However, since the underlying factor graph contains cycles, the
sum-product algorithm (SPA) yields a suboptimal algorithm. In this paper, we
develop and evaluate efficient strategies to improve the performance of the
factor graph-based symbol detection by means of neural enhancement. In
particular, we consider neural belief propagation and generalizations of the
factor nodes as an effective way to mitigate the effect of cycles within the
factor graph. By applying a generic preprocessor to the channel output, we
propose a simple technique to vary the underlying factor graph in every SPA
iteration. Using this dynamic factor graph transition, we intend to preserve
the extrinsic nature of the SPA messages which is otherwise impaired due to
cycles. Simulation results show that the proposed methods can massively improve
the detection performance, even approaching the maximum a posteriori
performance for various transmission scenarios, while preserving a complexity
which is linear in both the block length and the channel memory.Comment: revised version. arXiv admin note: text overlap with arXiv:2203.0333
Hardware implementation of a pipelined turbo decoder
Turbo codes have been widely studied since they were first proposed in 1993 by Berrou, Glavieux, and Thitimajshima in "Near Shannon Limit error-correcting coding and decoding: Turbo-codes" [1]. They have the advantage of providing a low bit error rate (BER) in decoding, and outperform linear block and convolutional codes in low signal-to-noise-ratio (SNR) environments. The decoding performance of turbo codes can be very close to the Shannon Limit, about 0.7decibel (dB). It is determined by the architectures of the constituent encoders and interleaver, but is bounded in high SNRs by an error floor. Turbo codes are widely used in communications. We explore the codeword weight spectrum properties that contribute to their excellent performance. Furthermore, the decoding performance is analyzed and compared with the free distance asymptotic performance. A 16-state turbo decoder is implemented using VHSIC Hardware Description Language (VHDL) and then mapped onto a field-programmable gate array (FPGA) board. The hardware implementations are compared with the software simulations to verify the decoding correctness. A pipelined architecture is then implemented which significantly reduces the decoding latency. -- Keywords: turbo codes; decoding performance; Monte Carlo simulations; FPGA implementatio
The Telecommunications and Data Acquisition Report
This quarterly publication provides archival reports on developments in programs managed by JPL's Telecommunications and Mission Operations Directorate (TMOD), which now includes the former Telecommunications and Data Acquisition (TDA) Office. In space communications, radio navigation, radio science, and ground-based radio and radar astronomy, it reports on activities of the Deep Space Network (DSN) in planning, supporting research and technology, implementation, and operations. Also included are standards activity at JPL for space data and information systems and reimbursable DSN work performed for other space agencies through NASA. The preceding work is all performed for NASA's Office of Space Communications (OSC). TMOD also performs work funded by other NASA program offices through and with the cooperation of OSC. The first of these is the Orbital Debris Radar Program funded by the Office of Space Systems Development. It exists at Goldstone only and makes use of the planetary radar capability when the antennas are configured as science instruments making direct observations of the planets, their satellites, and asteroids of our solar system. The Office of Space Sciences funds the data reduction and science analyses of data obtained by the Goldstone Solar System Radar. The antennas at all three complexes are also configured for radio astronomy research and, as such, conduct experiments funded by the National Science Foundation in the U.S. and other agencies at the overseas complexes. These experiments are either in microwave spectroscopy or very long baseline interferometry. Finally, tasks funded under the JPL Director's Discretionary Fund and the Caltech President's Fund that involve TMOD are included. This and each succeeding issue of 'The Telecommunications and Data Acquisition Progress Report' will present material in some, but not necessarily all, of the aforementioned programs
- …