82 research outputs found
A NOVEL ARCHITECTURE FOR PARKING MANAGEMENT IN SMART CITIES
Parking is becoming an expensive resource in almost any major city in the world. Current technically advanced solutions for parking management are concerned with the application of secured wireless network and sensor communication for parking reservation. Moreover new rules concerning financial transactions in mobile payment allow the definition of new intelligent frameworks that enable a convenient management of public parking in urban area. The paper discusses the conceptual architecture of IPA (Intelligent Parking Assistant) which aims at overcoming current parking management solutions and thereby becoming a leading paradigm for the so called "smart cities"
A Novel Architecture of Parking Management for Smart Cities
AbstractParking is becoming an expensive resource in almost any major city in the world. Current technically advanced solutions for parking management are concerned with the application of secured wireless network and sensor communication for parking reservation. Moreover new rules concerning financial transactions in mobile payment allow the definition of new intelligent frameworks that enable a convenient management of public parking in urban area. The paper discusses the conceptual architecture of IPA (Intelligent Parking Assistant) which aims at overcoming current parking management solutions and thereby becoming a leading paradigm for the so called “smart cities”
Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints
We propose a first step toward multilingual end-to-end automatic speech
recognition (ASR) by integrating knowledge about speech articulators. The key
idea is to leverage a rich set of fundamental units that can be defined
"universally" across all spoken languages, referred to as speech attributes,
namely manner and place of articulation. Specifically, several deterministic
attribute-to-phoneme mapping matrices are constructed based on the predefined
set of universal attribute inventory, which projects the knowledge-rich
articulatory attribute logits, into output phoneme logits. The mapping puts
knowledge-based constraints to limit inconsistency with acoustic-phonetic
evidence in the integrated prediction. Combined with phoneme recognition, our
phone recognizer is able to infer from both attribute and phoneme information.
The proposed joint multilingual model is evaluated through phoneme recognition.
In multilingual experiments over 6 languages on benchmark datasets LibriSpeech
and CommonVoice, we find that our proposed solution outperforms conventional
multilingual approaches with a relative improvement of 6.85% on average, and it
also demonstrates a much better performance compared to monolingual model.
Further analysis conclusively demonstrates that the proposed solution
eliminates phoneme predictions that are inconsistent with attributes
S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction
We address the video prediction task by putting forth a novel model that
combines (i) our recently proposed hierarchical residual vector quantized
variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN
(ST-PixelCNN). We refer to this approach as a sequential hierarchical residual
learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging
the intrinsic capabilities of HR-VQVAE at modeling still images with a
parsimonious representation, combined with the ST-PixelCNN's ability at
handling spatiotemporal information, S-HR-VQVAE can better deal with chief
challenges in video prediction. These include learning spatiotemporal
information, handling high dimensional data, combating blurry prediction, and
implicit modeling of physical characteristics. Extensive experimental results
on the KTH Human Action and Moving-MNIST tasks demonstrate that our model
compares favorably against top video prediction techniques both in quantitative
and qualitative evaluations despite a much smaller model size. Finally, we
boost S-HR-VQVAE by proposing a novel training method to jointly estimate the
HR-VQVAE and ST-PixelCNN parameters.Comment: 14 pages, 7 figures, 3 tables. Submitted to IEEE Transactions on
Pattern Analysis and Machine Intelligence on 2023-07-1
Una estrategia de procesamiento automático del habla basada en la detección de atributos
State-of-the-art automatic speech and speaker recognition systems are often built with a pattern matching framework that has proven to achieve low recognition error rates for a variety of resource-rich tasks when the volume of speech and text examples to build statistical acoustic and language models is plentiful, and the speaker, acoustics and language conditions follow a rigid protocol. However, because of the “blackbox” top-down knowledge integration approach, such systems cannot easily leverage a rich set of knowledge sources already available in the literature on speech, acoustics and languages. In this paper, we present a bottom-up approach to knowledge integration, called automatic speech attribute transcription (ASAT), which is intended to be “knowledge-rich”, so that new and existing knowledge sources can be verified and integrated into current spoken language systems to improve recognition accuracy and system robustness. Since the ASAT framework offers a “divide-and-conquer” strategy and a “plug-andplay” game plan, it will facilitate a cooperative speech processing community that every researcher can contribute to, with a view to improving speech processing capabilities which are currently not easily accessible to researchers in the speech science community.Los sistemas más novedosos de reconocimiento automático de habla y de locutor suelen basarse en un sistema de coincidencia de patrones. Gracias a este modo de trabajo, se han obtenido unos bajos índices de error de reconocimiento para una variedad de tareas ricas en recursos, cuando se aporta una cantidad abundante de ejemplos de habla y texto para el entrenamiento estadístico de los modelos acústicos y de lenguaje, y siempre que el locutor y las condiciones acústicas y lingüísticas sigan un protocolo estricto. Sin embargo, debido a su aplicación de un proceso ciego de integración del conocimiento de arriba a abajo, dichos sistemas no pueden aprovechar fácilmente toda una serie de conocimientos ya disponibles en la literatura sobre el habla, la acústica y las lenguas. En este artículo presentamos una aproximación de abajo a arriba a la integración del conocimiento, llamada transcripción automática de atributos del habla (conocida en inglés como automatic speech attribute transcription, ASAT). Dicho enfoque pretende ser “rico en conocimiento”, con el fin de poder verificar las fuentes de conocimiento, tanto nuevas como ya existentes, e integrarlas en los actuales sistemas de lengua hablada para mejorar la precisión del reconocimiento y la robustez del sistema. Dado que ASAT ofrece una estrategia de tipo “divide y vencerás” y un plan de juego de “instalación y uso inmediato” (en inglés, plugand-play), esto facilitará una comunidad cooperativa de procesamiento del habla a la que todo investigador pueda contribuir con vistas a mejorar la capacidad de procesamiento del habla, que en la actualidad no es fácilmente accesible a los investigadores de la comunidad de las ciencias del habla
Maximum a Posteriori Adaptation of Network Parameters in Deep Models
We present a Bayesian approach to adapting parameters of a well-trained
context-dependent, deep-neural-network, hidden Markov model (CD-DNN-HMM) to
improve automatic speech recognition performance. Given an abundance of DNN
parameters but with only a limited amount of data, the effectiveness of the
adapted DNN model can often be compromised. We formulate maximum a posteriori
(MAP) adaptation of parameters of a specially designed CD-DNN-HMM with an
augmented linear hidden networks connected to the output tied states, or
senones, and compare it to feature space MAP linear regression previously
proposed. Experimental evidences on the 20,000-word open vocabulary Wall Street
Journal task demonstrate the feasibility of the proposed framework. In
supervised adaptation, the proposed MAP adaptation approach provides more than
10% relative error reduction and consistently outperforms the conventional
transformation based methods. Furthermore, we present an initial attempt to
generate hierarchical priors to improve adaptation efficiency and effectiveness
with limited adaptation data by exploiting similarities among senones
Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification
In this paper, we propose a domain adaptation framework to address the device
mismatch issue in acoustic scene classification leveraging upon neural label
embedding (NLE) and relational teacher student learning (RTSL). Taking into
account the structural relationships between acoustic scene classes, our
proposed framework captures such relationships which are intrinsically
device-independent. In the training stage, transferable knowledge is condensed
in NLE from the source domain. Next in the adaptation stage, a novel RTSL
strategy is adopted to learn adapted target models without using paired
source-target data often required in conventional teacher student learning. The
proposed framework is evaluated on the DCASE 2018 Task1b data set. Experimental
results based on AlexNet-L deep classification models confirm the effectiveness
of our proposed approach for mismatch situations. NLE-alone adaptation compares
favourably with the conventional device adaptation and teacher student based
adaptation techniques. NLE with RTSL further improves the classification
accuracy.Comment: Accepted by Interspeech 202
On Mean Absolute Error for Deep Neural Network Based Vector-to-Vector Regression
In this paper, we exploit the properties of mean absolute error (MAE) as a loss function for the deep neural network (DNN) based vector-to-vector regression. The goal of this work is two-fold: (i) presenting performance bounds of MAE, and (ii) demonstrating new properties of MAE that make it more appropriate than mean squared error (MSE) as a loss function for DNN based vector-to-vector regression. First, we show that a generalized upper-bound for DNN-based vector-to-vector regression can be ensured by leveraging the known Lipschitz continuity property of MAE. Next, we derive a new generalized upper bound in the presence of additive noise. Finally, in contrast to conventional MSE commonly adopted to approximate Gaussian errors for regression, we show that MAE can be interpreted as an error modeled by Laplacian distribution. Speech enhancement experiments are conducted to corroborate our proposed theorems and validate the performance advantages of MAE over MSE for DNN based regression
- …