8 research outputs found
Visual Pivoting for (Unsupervised) Entity Alignment
This work studies the use of visual semantic representations to align
entities in heterogeneous knowledge graphs (KGs). Images are natural components
of many existing KGs. By combining visual knowledge with other auxiliary
information, we show that the proposed new approach, EVA, creates a holistic
entity representation that provides strong signals for cross-graph entity
alignment. Besides, previous entity alignment methods require human labelled
seed alignment, restricting availability. EVA provides a completely
unsupervised solution by leveraging the visual similarity of entities to create
an initial seed dictionary (visual pivots). Experiments on benchmark data sets
DBP15k and DWY15k show that EVA offers state-of-the-art performance on both
monolingual and cross-lingual entity alignment tasks. Furthermore, we discover
that images are particularly useful to align long-tail KG entities, which
inherently lack the structural contexts necessary for capturing the
correspondences.Comment: To appear at AAAI-202
Encoder-Decoder Based Long Short-Term Memory (LSTM) Model for Video Captioning
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions. The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence. Data preprocessing, model construction, and model training are discussed. Caption correctness is evaluated using 2-gram BLEU scores across the different splits of the dataset. Specific examples of output captions were shown to demonstrate model generality over the video temporal dimension. Predicted captions were shown to generalize over video action, even in instances where the video scene changed dramatically. Model architecture changes are discussed to improve sentence grammar and correctness
Deep Learning Based Video Captioning through Encoder-Decoder Based Long Short-Term Memory (LSTM)
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions. The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence. Data preprocessing, model construction, and model training are discussed. Caption correctness is evaluated using 2-gram BLEU scores across the different splits of the dataset. Specific examples of output captions were shown to demonstrate model generality over the video temporal dimension. Predicted captions were shown to generalize over video action, even in instances where the video scene changed dramatically. Model architecture changes are discussed to improve sentence grammar and correctness
Captioning Deep Learning Based Encoder-Decoder through Long Short-Term Memory (LSTM)
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions. The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence. Data preprocessing, model construction, and model training are discussed. Caption correctness is evaluated using 2-gram BLEU scores across the different splits of the dataset. Specific examples of output captions were shown to demonstrate model generality over the video temporal dimension. Predicted captions were shown to generalize over video action, even in instances where the video scene changed dramatically. Model architecture changes are discussed to improve sentence grammar and correctnes
Deep Learning Based Video Captioning through Encoder-Decoder Based Long Short-Term Memory (LSTM)
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions. The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence. Data preprocessing, model construction, and model training are discussed. Caption correctness is evaluated using 2-gram BLEU scores across the different splits of the dataset. Specific examples of output captions were shown to demonstrate model generality over the video temporal dimension. Predicted captions were shown to generalize over video action, even in instances where the video scene changed dramatically. Model architecture changes are discussed to improve sentence grammar and correctness
Visual grounding in video for unsupervised word translation
There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods - it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese - all without any parallel corpora and simply by watching many videos of people speaking while doing things