293,012 research outputs found
Named Entity Recognition in Twitter using Images and Text
Named Entity Recognition (NER) is an important subtask of information
extraction that seeks to locate and recognise named entities. Despite recent
achievements, we still face limitations with correctly detecting and
classifying entities, prominently in short and noisy text, such as Twitter. An
important negative aspect in most of NER approaches is the high dependency on
hand-crafted features and domain-specific knowledge, necessary to achieve
state-of-the-art results. Thus, devising models to deal with such
linguistically complex contexts is still challenging. In this paper, we propose
a novel multi-level architecture that does not rely on any specific linguistic
resource or encoded rule. Unlike traditional approaches, we use features
extracted from images and text to classify named entities. Experimental tests
against state-of-the-art NER for Twitter on the Ritter dataset present
competitive results (0.59 F-measure), indicating that this approach may lead
towards better NER models.Comment: The 3rd International Workshop on Natural Language Processing for
Informal Text (NLPIT 2017), 8 page
A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation
Body language (BL) refers to the non-verbal communication expressed through
physical movements, gestures, facial expressions, and postures. It is a form of
communication that conveys information, emotions, attitudes, and intentions
without the use of spoken or written words. It plays a crucial role in
interpersonal interactions and can complement or even override verbal
communication. Deep multi-modal learning techniques have shown promise in
understanding and analyzing these diverse aspects of BL. The survey emphasizes
their applications to BL generation and recognition. Several common BLs are
considered i.e., Sign Language (SL), Cued Speech (CS), Co-speech (CoS), and
Talking Head (TH), and we have conducted an analysis and established the
connections among these four BL for the first time. Their generation and
recognition often involve multi-modal approaches. Benchmark datasets for BL
research are well collected and organized, along with the evaluation of SOTA
methods on these datasets. The survey highlights challenges such as limited
labeled data, multi-modal learning, and the need for domain adaptation to
generalize models to unseen speakers or languages. Future research directions
are presented, including exploring self-supervised learning techniques,
integrating contextual information from other modalities, and exploiting
large-scale pre-trained multi-modal models. In summary, this survey paper
provides a comprehensive understanding of deep multi-modal learning for various
BL generations and recognitions for the first time. By analyzing advancements,
challenges, and future directions, it serves as a valuable resource for
researchers and practitioners in advancing this field. n addition, we maintain
a continuously updated paper list for deep multi-modal learning for BL
recognition and generation: https://github.com/wentaoL86/awesome-body-language
Knowledge-enhanced Visual-Language Pre-training on Chest Radiology Images
While multi-modal foundation models pre-trained on large-scale data have been
successful in natural language understanding and vision recognition, their use
in medical domains is still limited due to the fine-grained nature of medical
tasks and the high demand for domain knowledge. To address this challenge, we
propose a novel approach called Knowledge-enhanced Auto Diagnosis (KAD) which
leverages existing medical domain knowledge to guide vision-language
pre-training using paired chest X-rays and radiology reports. We evaluate KAD
on {four} external X-ray datasets and demonstrate that its zero-shot
performance is not only comparable to that of fully-supervised models, but also
superior to the average of three expert radiologists for three (out of five)
pathologies with statistical significance. Moreover, when few-shot annotation
is available, KAD outperforms all existing approaches in fine-tuning settings,
demonstrating its potential for application in different clinical scenarios
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition
Video Action Recognition (VAR) is a challenging task due to its inherent
complexities. Though different approaches have been explored in the literature,
designing a unified framework to recognize a large number of human actions is
still a challenging problem. Recently, Multi-Modal Learning (MML) has
demonstrated promising results in this domain. In literature, 2D skeleton or
pose modality has often been used for this task, either independently or in
conjunction with the visual information (RGB modality) present in videos.
However, the combination of pose, visual information, and text attributes has
not been explored yet, though text and pose attributes independently have been
proven to be effective in numerous computer vision tasks. In this paper, we
present the first pose augmented Vision-language model (VLM) for VAR. Notably,
our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video
action recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even
without any video data pre-training, and an accuracy of 96.11% and 75.75% after
kinetics pre-training.Comment: 7 pages, 3 figures, 2 Table
Automatic transcription of multi-genre media archives
This paper describes some recent results of our collaborative work on
developing a speech recognition system for the automatic transcription
or media archives from the British Broadcasting Corporation (BBC). The
material includes a wide diversity of shows with their associated
metadata. The latter are highly diverse in terms of completeness,
reliability and accuracy. First, we investigate how to improve lightly
supervised acoustic training, when timestamp information is inaccurate
and when speech deviates significantly from the transcription, and how
to perform evaluations when no reference transcripts are available.
An automatic timestamp correction method as well as a word and segment
level combination approaches between the lightly supervised transcripts
and the original programme scripts are presented which yield improved
metadata. Experimental results show that systems trained using the
improved metadata consistently outperform those trained with only the
original lightly supervised decoding hypotheses. Secondly, we show that
the recognition task may benefit from systems trained on a combination
of in-domain and out-of-domain data. Working with tandem HMMs, we
describe Multi-level Adaptive Networks, a novel technique for
incorporating information from out-of domain posterior features using
deep neural network. We show that it provides a substantial reduction in
WER over other systems including a PLP-based baseline, in-domain tandem
features, and the best out-of-domain tandem features.This research was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).This paper was presented at the First Workshop on Speech, Language and Audio in Multimedia, August 22-23, 2013; Marseille. It was published in CEUR Workshop Proceedings at http://ceur-ws.org/Vol-1012/
- …