869 research outputs found
Multilingual Universal Sentence Encoder for Semantic Retrieval
We introduce two pre-trained retrieval focused multilingual sentence encoding
models, respectively based on the Transformer and CNN model architectures. The
models embed text from 16 languages into a single semantic space using a
multi-task trained dual-encoder that learns tied representations using
translation based bridge tasks (Chidambaram al., 2018). The models provide
performance that is competitive with the state-of-the-art on: semantic
retrieval (SR), translation pair bitext retrieval (BR) and retrieval question
answering (ReQA). On English transfer learning tasks, our sentence-level
embeddings approach, and in some cases exceed, the performance of monolingual,
English only, sentence embedding models. Our models are made available for
download on TensorFlow Hub.Comment: 6 pages, 6 tables, 2 listings, and 1 figur
How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?
Current language models have been criticised for learning language from text
alone without connection between words and their meaning. Consequently,
multimodal training has been proposed as a way for creating models with better
language understanding by providing the lacking connection. We focus on
pre-trained multimodal vision-and-language (VL) models for which there already
are some results on their language understanding capabilities. An unresolved
issue with evaluating the linguistic skills of these models, however, is that
there is no established method for adapting them to text-only input without
out-of-distribution uncertainty. To find the best approach, we investigate and
compare seven possible methods for adapting three different pre-trained VL
models to text-only input. Our evaluations on both GLUE and Visual Property
Norms (VPN) show that care should be put into adapting VL models to zero-shot
text-only tasks, while the models are less sensitive to how we adapt them to
non-zero-shot tasks. We also find that the adaptation methods perform
differently for different models and that unimodal model counterparts perform
on par with the VL models regardless of adaptation, indicating that current VL
models do not necessarily gain better language understanding from their
multimodal training
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Transformer is a deep neural network that employs a self-attention mechanism
to comprehend the contextual relationships within sequential data. Unlike
conventional neural networks or updated versions of Recurrent Neural Networks
(RNNs) such as Long Short-Term Memory (LSTM), transformer models excel in
handling long dependencies between input sequence elements and enable parallel
processing. As a result, transformer-based models have attracted substantial
interest among researchers in the field of artificial intelligence. This can be
attributed to their immense potential and remarkable achievements, not only in
Natural Language Processing (NLP) tasks but also in a wide range of domains,
including computer vision, audio and speech processing, healthcare, and the
Internet of Things (IoT). Although several survey papers have been published
highlighting the transformer's contributions in specific fields, architectural
differences, or performance evaluations, there is still a significant absence
of a comprehensive survey paper encompassing its major applications across
various domains. Therefore, we undertook the task of filling this gap by
conducting an extensive survey of proposed transformer models from 2017 to
2022. Our survey encompasses the identification of the top five application
domains for transformer-based models, namely: NLP, Computer Vision,
Multi-Modality, Audio and Speech Processing, and Signal Processing. We analyze
the impact of highly influential transformer-based models in these domains and
subsequently classify them based on their respective tasks using a proposed
taxonomy. Our aim is to shed light on the existing potential and future
possibilities of transformers for enthusiastic researchers, thus contributing
to the broader understanding of this groundbreaking technology
Ultrafast Video Attention Prediction with Coupled Knowledge Distillation
Large convolutional neural network models have recently demonstrated
impressive performance on video attention prediction. Conventionally, these
models are with intensive computation and large memory. To address these
issues, we design an extremely light-weight network with ultrafast speed, named
UVA-Net. The network is constructed based on depth-wise convolutions and takes
low-resolution images as input. However, this straight-forward acceleration
method will decrease performance dramatically. To this end, we propose a
coupled knowledge distillation strategy to augment and train the network
effectively. With this strategy, the model can further automatically discover
and emphasize implicit useful cues contained in the data. Both spatial and
temporal knowledge learned by the high-resolution complex teacher networks also
can be distilled and transferred into the proposed low-resolution light-weight
spatiotemporal network. Experimental results show that the performance of our
model is comparable to ten state-of-the-art models in video attention
prediction, while it costs only 0.68 MB memory footprint, runs about 10,106 FPS
on GPU and 404 FPS on CPU, which is 206 times faster than previous models
Online Continual Learning on Sequences
Online continual learning (OCL) refers to the ability of a system to learn
over time from a continuous stream of data without having to revisit previously
encountered training samples. Learning continually in a single data pass is
crucial for agents and robots operating in changing environments and required
to acquire, fine-tune, and transfer increasingly complex representations from
non-i.i.d. input distributions. Machine learning models that address OCL must
alleviate \textit{catastrophic forgetting} in which hidden representations are
disrupted or completely overwritten when learning from streams of novel input.
In this chapter, we summarize and discuss recent deep learning models that
address OCL on sequential input through the use (and combination) of synaptic
regularization, structural plasticity, and experience replay. Different
implementations of replay have been proposed that alleviate catastrophic
forgetting in connectionists architectures via the re-occurrence of (latent
representations of) input sequences and that functionally resemble mechanisms
of hippocampal replay in the mammalian brain. Empirical evidence shows that
architectures endowed with experience replay typically outperform architectures
without in (online) incremental learning tasks.Comment: L. Oneto et al. (eds.), Recent Trends in Learning From Data, Studies
in Computational Intelligence 89
Human detection in real time with thermal camera using drones
We currently live in a world where technology predominates and advances very quickly, developing applications and devices that help us in daily life to improve our lives and solve everyday problems. One of these technologies that are in full development and evolution are the Drones, or also called RPAS (Remotely Piloted Aircraft System). These systems are unmanned aerial vehicles that offer us infinity applications to cover these needs and problems that we have on a daily basis. On the other hand, another technology that is in full development and that has been seen to have great potential for developing applications is Artificial Intelligence (AI). To quickly define AI, we say that it is the intelligence that humans develop with body, brain and mind but expressed by a machine, processor and software. Thanks to the fact that Drones can fly over places where humans cannot reach, through their cameras we can see what is in those areas, so one of the most useful applications in Drones is Search and Rescue operations. The main focus of this project is to merge RPAS with AI to develope a Search and Rescue application. For this, an executable software has been created for any computer, which will allow to detect people lost in the forest or other places, through a video, either through a Streaming on Youtube or a video saved locally, allowing detection both in real time as well as in deferred time, making the detection done by the machine and the human being able to do other functions while the search is done with the Drone. The objective is to use both, the thermal and visual cameras of a Drone, to record a video or stream the image and send it to the software so that, through Artificial Intelligence, if it finds a person, it detects the human and an alarm sounds. The software has been developed in Python, an open source, cross-platform programming language that can be used for web development, software creation, and data processing. This language is truly useful since it is one of the most used in the world of programming, thus there are multiple libraries, as OpenCV, created by open source users that have allowed the development of this human detection software. To develop the program, it has been necessary to train the machine with images of people. These images have been obtained with real flights in the area of Collserola, Barcelona. This area is close to the airport of Barcelona El Prat, so permissions and coordination are needed to be able to fly completely legally. For this reason, this project has also included all the documentation and legal part necessary to be able to fly in the Barcelona area
Word-Level Representation From Bytes For Language Modeling
Modern language models mostly take sub-words as input, a design that balances
the trade-off between vocabulary size, number of parameters, and performance.
However, sub-word tokenization still has disadvantages like not being robust to
noise and difficult to generalize to new languages. Also, the current trend of
scaling up models reveals that larger models require larger embeddings but that
makes parallelization hard. Previous work on image classification proves
splitting raw input into a sequence of chucks is a strong, model-agnostic
inductive bias. Based on this observation, we rethink the existing
character-aware method that takes character-level inputs but makes word-level
sequence modeling and prediction. We overhaul this method by introducing a
cross-attention network that builds word-level representation directly from
bytes, and a sub-word level prediction based on word-level hidden states to
avoid the time and space requirement of word-level prediction. With these two
improvements combined, we have a token free model with slim input embeddings
for downstream tasks. We name our method Byte2Word and perform evaluations on
language modeling and text classification. Experiments show that Byte2Word is
on par with the strong sub-word baseline BERT but only takes up 10\% of
embedding size. We further test our method on synthetic noise and cross-lingual
transfer and find it competitive to baseline methods on both settings.Comment: preprin
- …